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STRUCTURE-BASED CONSTRUCTION OF 
HUMAN ANTIBODY LIBRARY 



BACKGROUND OF THE INVENTION 

5 

Field of the Invention 

The present invention relates generally to computer-aided design 
of human antibody sequence libraries, and, more particularly, relates to 
methods and systems for selecting and constructing fully human or 
10 human-derived antibody library based on three-dimensional structural 
frameworks of vertebrate antibody repertoire. 

Description of Related Art 

Antibodies are made by vertebrates in response to various 

15 internal and external stimuli (antigens). Synthesized exclusively by B 
cells, antibodies are produced in millions of forms, each with different 
amino acid sequence and a different binding site for antigen. 
Collectively called immunoglobulins (abbreviated as Ig), they are among 
the most abundant protein components in the blood, constituting about 

20 20% of the total plama protein by weight. 

A naturally occurring antibody molecule consists of two identical 
"light" (L) protein chains and two identically "heavy" (H) protein chains, 
all held together covalently by precisely located disulfide linkages. 
Chothia et al. (1985) J. Mol. Biol. 186:651-663; and Novotny and Haber 

25 (1985) Proc. Natl. Acad. Sci. USA 82:4592-4596. The N-terminal 

regions of the L and H chains together form the antigen recognition site 
of each antibody. 

The mammalian immune system has evolved unique genetic 
mechanisms that enable it to generate an almost unlimited number of 

30 different light and heavy chains in a remarkably economical way by 
joining separate gene segments together before they are transcribed. 
For each type of Ig chain — k light chains, X light chains, and heavy 
chain — there is a sssseparate pool of gene segments from which a single 
peptide chain is eventually synthesized. Each pool is on a different 

35 chromosome and usually contains a large number of gene segments 
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encoding the V region of an Ig chain and a smaller number of gene 
segments encoding the C region. During B cell development a complete 
coding sequence for each of the two Ig chains to be synthesized is 
assembled by site-specific genetic recombination, bringing together the 
5 entire coding sequences for a V region and the coding sequence for a C 
region. In addition, the V region of a light chain is encoded by a DNA 
sequence assembled from two gene segments— a V gene segment and 
short joining or J gene segment. The V region of a heavy chain is 
encoded by a DNA sequence assembled from three gene segments— a V 

1 0 gene segment, a J gene segment and a diversity or D segment. 

The large number of inherited V, J and D gene segments 
available for encoding Ig chains makes a substantial contribution on its 
own to antibody diversity, but the combinatorial joining of these 
segments greatly increases this contribution. Further, imprecise joining 

15 of gene segments and somatic mutations introduced during the V-D-J 
segment joining at the pre-B cell stage greatly increases the diversity of 
the V regions. 

After immunization against an antigen, a mammal goes through 
a process known as affinity maturation to produce antibodies with 
20 higher affinity toward the antigen. Such antigen-driven somatic 
hypermutation fine-tunes antibody responses to a given antigen, 
presumably due to the accumulation of point mutations specifically in 

< 

both heavy-and light-chain V region coding sequences and a selected 
expansion of high-affinity antibody-bearing B cell clones. 

25 Structurally, various functions of an antibody are confined to 

discrete protein domains (regions). The sites that recognize and bind 
antigen consist of three complementarity-determining regions (CDRs) 
that lie within the variable (Vh and Vl) regions at the N-terminal ends of 
the two H and two L chains. The constant domains are not involved 

30 directly in bin<iing the antibody to an antigen, but are involved in 
various effector functions, such as participation of the antibody in 
antibody-dependent cellular cytotoxicity. 

The domains of natural light and heavy chains have the same 
general structures, and each domain comprises four framework regions, 

35 whose sequences are somewhat conserved, connected by three hyper- 
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variable or CDRs. The four framework regions largely adopt a p-sheet 
conformation and the CDRs form loops connecting, and in some cases 
forming part of, the p-sheet structure. The CDRs in each chain are held 
in close proximity by the framework regions and, with the CDRs from 
5 the other chain, contribute to the formation of the antigen binding site. 

Generally all antibodies adopt a characteristic "immunoglobulin 
fold". Specifically, both the variable and constant domains of an 
antigen binding fragment (Fab, consisting of V L and C L of the light chain 
and V H and ChI of the heavy chain) consist of two twisted antiparallel P- 
10 sheets which form a p-sandwich structure. The constant regions have 
three- and four-stranded p-sheets arranged in a Greek key-like motif, 
while variable regions have a further two short p strands producing a 
five-stranded p-sheet. 

The Vl and V H domains interact via the five-stranded P sheets to 

15 form a nine-stranded p barrel of about 8.4 A radius, with the strands at 
the domain interface inclined at approximately 50° to one another. The 
domain pairing brings the CDR loops into close proximity. The CDRs 
themselves form some 25% of the V L /V H domain interface. 

The six CDRs, (CDR-L1, -L2 and -L3 for the light chain, and 

20 CDR-H1, -H2 and -H3 for the heavy chain), are supported on the p 
barrel framework, forming the antigen binding site. While their 
sequence is hypervariable in comparison with the rest of the 
immunoglobulin structure, some of the loops show a relatively high 
degree of both sequence and structural conservation. In particular, 

25 CDR-L2 and CDR-H1 are highly conserved in conformation. 

Chothia and co-workers have shown that five of the six CDR 
loops (all except CDR-H3) adopt a discrete, limited number of main- 
chain conformations (termed canonical structures of the CDRs) by 
analysis of conserved key residues. Chothia and Lesk (1987) J. Mol. 

30 Biol. 196:901-917; Chothia et al. (1989) Nature (London) 342:877; and 
Chothia et al. (1998) J. Mol. Biol. 278:457-479. The adopted structure 
depends on both the CDR length and the identity of certain key amino 
acid residues, both in the CDR and in the contacting framework, 
involved in its packing. The canonical conformations were determined 

35 by specific packing, hydrogen bonding interactions, and stereochemical 
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constraints of only these key residues which serve as structural 
determinants. 

Various methods have been developed for modeling the three 
dimensional structures of the antigen binding site of an antibody. 
5 Other than x-ray crystallography, nuclear magnetic resonance (NMR) 
spectroscopy has been used in combination with computer modeling 
building to study the atomic details of antibody-ligand interactions. 
Dwek et al. (1975) Eur. J. Biochem. 53:25-39. Dwek and coworkers 
used spin-labeled hapten to deduce the combining site of the MoPC 315 

10 myeloma protein for dinitrophenyl. Similar analysis has also been done 
an ti- spin label monoclonal antibodies (Anglister et al. (1987) Biochem. 
26: 6958-6064) and on the anti-2-phenyloxazolone Fv fragments 
(McManus and Riechmann (1991) Biochem. 30:5851-5857). 

Computer-implemented analysis and modeling of antibody 

15 combining site (or antigen binding site) is based on homology analysis 
comparing the target antibody sequence with those of antibodies with 
known structures or structural motifs in existing data bases (e.g. the 
Brookhaven Protein Data Bank). By using such homology modeling 
methods approximate three-dimensional structure of the target 

20 antibody is constructed. Early antibody modeling was based on the 

conjecture that CDR loops with identical length and different sequence 
may adopt similar conformations. Kabat and Wu (1972) Proc. Natl. 
Acad. Sci. USA 69: 960-964. A typical segment match algorithm is as 
follows: given a loop sequence, the Protein Data Bank can be searched 

25 for short, homologous backbone fragments (e.g. tripeptides) which are 
then assembled and computationally refined into a new combining site 
model. 

More recently, the canonical loop concept has been incorporated 
into computer-implemented structural modeling of antibody combining 

30 site. In its most general form, the canonical structure concept assumes 
that (1) sequence variation at other than canonical positions is 
irrelevant for loop conformation, (2) canonical loop conformations are 
essentially independent of loop-loop interactions, and (3) only a limited 
number of canonical motifs exist and these are well represented in the 

35 database of currently known antibody crystal structures. Based on this 
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concept, Chothia predicted all six CDR loop conformations in the 
lysozyme-binding antibody Dl .3 and five canonical loop conformations 
in four other antibodies. Chothia (1989), supra. It is also possible to 
improve modeling of CDRs of antibody structures by combining the 
5 homology modeling with conformational search procedures. Martin, 
A.C.R. (1989) PNAS 86, 9268-72, 

Besides modeling a specific antibody structure, efforts have been 
made in generating artificial (or synthetic) libraries of antibodies which 
are screened against one or more specific, target antigens. Various 

10 artificial sequences are generated by site-specific or random 

mutagenesis on the antibody sequence, especially into the CDR regions 
of the variable domains. For example, Iba et al. used computer-driven 
model building system to change the specificity of antibodies against 
steroid antigens by introducing mutations into the CDR regions. Iba et 

15 al. (1998) Protein Eng. 1 1:361-370. A phage-display library of Abs in 

which 16 residues of 17-a- hydroxyprogesterone (17-OHP) were mutated 
in three CDR regions of the heavy chain that appeared to form the 
steroid-binding pocket. The phage display library were screened against 
17-OHP and Cortisol that had been conjugated with bovine serum 

20 albumin. Many clones were isolated that had retained 17-OHP-binding 
ability as well as clones with the newly developed ability to bind Cortisol 
in addition to 17-OHP. 

Phage display technology has been used extensively to generate 
large libraries of antibody fragments by exploiting the capability of 

25 bacteriophage to express and display biologically functional protein 

molecule on its surface. Combinatorial libraries of antibodies have been 
generated in bacteriophage lambda expression systems which may be 
screened as bacteriophage plaques or as colonies of lysogens (Huse et 
al. (1989) Science 246: 1275; Caton and Koprowski (1990) Proc. Natl. 

30 Acad. Sci. (U.S.A.) 87: 6450; Mullinax et al (1990) Proc. Natl. Acad. Sci. 
(U.S.A.) 87: 8095; Persson et al. (1991) Proc. Natl. Acad. Sci. (U.S.A.) 88: 
2432). Various embodiments of bacteriophage antibody display 
libraries and lambda phage expression libraries have been described 
(Kang et al. (1991) Proc. Natl. Acad. Sci. (U.S.A.) 88: 4363; Clackson et 

35 al. (1991) Nature 352: 624; McCafferty et al. (1990) Nature 348: 552; 
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Burton et al. (1991) Proc. Natl. Acad. Sci. (U.S.A.) 88: 10134; 
Hoogenboom et al. (1991) Nucleic Acids Res. 19: 4133; Chang et al.' 
(1991) J. Immunol. 147: 3610; Breitlinget al. (1991) Gene 104: 147; 
Marks et al. (1991) J. Mol. Biol. 222: 581; Barbas et al. (1992) Proc. 
5 Natl. Acad. Sci. (U.S.A.) 89: 4457; Hawkins and Winter (1992) J. 

Immunol. 22: 867; Marks et al. (1992) Biotechnology 10: 779; Marks et 
al. (1992) J. Biol. Chem. 267: 16007; Lowman et al (1991) Biochemistry 
30: 10832; Lerner et al. (1992) Science 258: 1313). Also see review by 
Rader, C. and Barbas, C. F. (1997) "Phage display of combinatorial 

10 antibody libraries" Curr. Opin. Biotechnol. 8:503-508. 

Generally, a phage library is created by inserting a library of a 
random oligonucleotide or a cDNA library encoding antibody fragment 
such as Viand Vh into gene 3 of M13 or fd phage. Each inserted gene is 
expressed at the N-terminal of the gene 3 product, a minor coat protein 

15 of the phage. As a result, peptide libraries that contain diverse peptides 
can be constructed. The phage library is then affinity screened against 
immobilized target molecule of interest, such as an antigen, and 
specifically bound phages are recovered and amplified by infection into 
Escherichia coli host cells. Typically, the target molecule of interest 

20 such as a receptor (e.g., polypeptide, carbohydrate, glycoprotein, nucleic 
acid) is immobilized by covalent linkage to a chromatography resin to 
enrich for reactive phage by affinity chromatography) and /or labeled for 
screen plaques or colony lifts. This procedure is called biopanning. 
Finally, amplified phages can be sequenced for deduction of the specific 

25 peptide sequences. 

The sequences of the antibodies in these phage display libraries are 
from natural sources. For example, cDNA of antibody gene pools have 
been, generated from immunized or naive human or rodents. Barbas 
and Burton (1996) Trends Biotech. 14:230-234 (immunized donors); De 

30 Haard et al. (1999) J. Biol. Chem. 274:18218-18230 (naive B-cell Ig 

repertoires). The antibody cDNA library can be constructed by reverse 
transcription of RNA encoding the gene pool from total RNA samples 
isolated from B cells contained in peripheral blood supplied by human 
or animal. First strand cDNA synthesis is usually performed using the 

35 method of Marks et al. in which a set of heavy and light chain cDNA 
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primers are designed to anneal to the constant regions for priming the 
systhesis of cDNA of heavy chain and light chains (both k and X) 
antibody genes in separate tubes. Marks et al. (1991) Eur. J. Immunol. 
21:985-991. 

5 Synthetic or artificial libraries of antibody sequences were 

constructed in vitro from human germline sequences. Griffiths et al. 
(1994JEMBOJ. 13:3245-3260. Highly diverse repertoires of heavy and 
light chains were created entirely in vitro from a bank of human V gene 
segments and then, by recombination of the repertoire in bacteria, an 

10 even larger (close to 6.5xl0 10 ) synthetic library of Fab fragments were 
generated in bacteria and displayed on filamentous phage. 

Highly diverse synthetic libraries of antibody sequences were also 
constructed based on consensus sequences of each germline family of 
human antibody repertoire. For example, a fully synthetic 

15 combinatorial antibody library was constructed based on modular 
consensus frameworks and CDR3 regions in heavy and light chains 
randomized with trinucleotides. Knappik et al. (2000) J. Mol. Biol. 
296:57-86. Knappik et al. analysed the human antibody repertoire in 
terms of structure, amino acid sequence diversity and germline usage. 

20 Modular consensus framework sequences with seven Vh and seven Vl 
were derived to cover 95% of variable germline families and optimized 
for expression in E. coli. A consensus sequence was derived for each 
highly used germline family and optimized for expression in £. coli. 
Molecular modeling of their CDR loops of the consensus sequences 

25 verified that all canonical classes were present. Diversity of the 

antibody library was created by replacing the CDR3 regions of seven V H 
and seven Vl frameworks of the master genes by CDR3 library 
cassettte. A synthetic library of combinatorial antibody was generated 
from mixed trinucleotides and biased towards natural human antibody 

30 CDR3 sequences. This library was cloned into phagemid and expressed 
as soluble proteins in the periplasm of E. coli. 
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SUMMARY OP THE INVENTION 

The present invention provides a comprehensive methodology to 
map the functional space of proteins by exploiting the fundamental 
5 structure-sequence relationship within protein families. The 

methodology of the present invention provides for efficient in silico 
selection and construction of a library of antibodies with diverse 
sequences. By using the methodology libraries of antibodies can be 
constructed with diverse sequences in the CDR regions, and humanized 

10 frameworks of the variable regions having fully human, human-derived 
antibody, or antibody of human origin (collectively referred to herein as 
"human antibody") based on three-dimensional structures of antibodies 
generated by all species of vertebrates including human. 

In one aspect of the invention, a method is provided for 

15 constructing a library of artificial antibodies in silico based on 

ensembles of 3D structures of existing antibodies of human origin, 
optional also including those of other vertebrate origins. By using the 
method, a master library of human antibody sequences can be selected 
to better represent all antibody structural repertoire in the vertebrate 

20 antibody repertoire that are functionally important for high affinity 

binding to antigens and eliciting antibody-dependent cellular responses. 

Such a functionally representative library provides a structurally 
diverse and yet functionally more relevant source of antibody 
candidates which can then be screened for binding to a wide variety of 

25 target molecules, including but not limited to biomacromolecules such 
as protein, peptide, and nucleic acids, and small molecules. 

In one embodiment, the method comprises the steps of: 
clustering variable regions of a collection of antibodies having 
known 3D structures into at least two families of structural ensembles, 

30 each family of structural ensemble comprising at least two different 
antibody sequences but with substantially identical main chain 
conformations; 

selecting a representative structural template from each family of 
structural ensemble; 
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profiling a tester polypeptide sequence onto the representative 
structural template within each family of structural ensemble; and 

selecting the tester antibody sequence that is compatible to the 
structural constraints of the representative structural template. , 
5 According to the method, examples of the collection of antibodies 

include, but are not limited to, antibodies or immunoglobulins collected 
in a protein database such as the protein data bank of Brookhaven 
National Laboratory, genbank at the National Institute of Health, and 
Swiss-PROT protein sequence database. 
10 The collection of antibodies having known 3D structures include 

any antibody having resolved X-ray crystal structure, NMR structure or 
a 3D structure based on structural modeling such as homology 
modeling. ' 

The variable regions of a collection of antibodies may be the full 
15 length of the heavy chain or light chain variable region or a specific 

portion of the heavy chain or light chain variable region, such as a CDR 
(e.g., V H or V L CDR1, CDR2, and CDR3), a framework region (FR, e.g., 
V H or V L FR1, FR2, FR3, and FR4), and a combination thereof. 

Also according to the method, the clustering step includes 
20 clustering the collection of antibodies such that the root mean square 
difference of the main chain conformations of antibody sequences in 
each family of the structural ensemble is preferably less than 4 A, more 
preferably less than 3 A, and most preferably less than 2 A. 

Optionally, the clustering step includes clustering the collection 
25 of antibodies such that the Z-score of the main chain conformations of 
antibody sequences in each family of the structural ensemble is 
preferably more than 2, more preferably more than 3, and most 
preferably more than 4. 

The clustering step may be implemented by an algorithm 
30 selected from the group consisting of CE, Monte Carlo and 3D 
clustering algorithms. 

Also according to the method, the profiling step includes reverse 
threading the tester polypeptide sequence onto the representative 
structural template within each family of structural ensemble. 
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Optionally, the profiling step is implemented by a multiple 
sequence alignment algorithm such as the profile HMM algorithm and 
PSI-BLAST (Position- Specific Iterated BLAST). 

When the representative structural template is adopted by a CDR 
5 region, the profiling step includes profiling the tester polypeptide 
sequence that is a human or non-human antibody onto the 
representative structural template within each family of structural 
ensemble. 

When the representative structural template is adopted by a FR 
10 region, the profiling step includes profiling the tester polypeptide 

sequence that is a human or non-human antibody, preferably a human 
germline antibody sequence, onto the representative structural 
template within each family of structural ensemble. 

In another aspect of the invention, a method is provided for in 
15 silico selection of antibody sequences based on structural alignment 

with a target structural template. Diverse sequences which still retain 
the same functionally relevant structure as the target structural 
template can be selected by using reverse threading, the profile HMM 
algorithm and PSI-BLAST. By using the method, a library of diverse 
20 antibody sequences can be constructed and screened experimentally in 
vitro or in vivo for antibody mutants with improved or desired 
function(s). 

■ 

In one embodiment, the method comprises the steps of: 
providing a target structural template of a variable region of one 
25 or more antibodies; 

profiling a tester polypeptide sequence onto the target structural 
template; and 

selecting the tester polypeptide sequence that is structurally 
compatible with the target structural template. 
30 According to the method, the target structural template may be a 

3D structure of a heavy chain or light chain variable region of an 
antibody (e.g., CDR, FR and a combination thereof), or a structural 
ensemble of heavy chain or light chain variable regions of at least two 
different antibodies. 
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Also according to the method, the profiling step includes reverse 
threading the tester polypeptide sequence onto the target structural 
template. 

Optionally, the profiling step is implemented by a multiple 
5 sequence alignment algorithm such as the profile HMM algorithm and 
PSI-BLAST. 

Also optionally, when the target structural template is adopted 
by a CDR region of the target antibody, the profiling step includes 
profiling a heavy chain or light chain variable region of the tester 
10 polypeptide sequence that is either a human antibody or a non-human 
antibody. 

Also optionally, when the target structural template is adopted 
by a FR region of the target antibody, the profiling step includes 
profiling a heavy chain or light chain variable region of the tester 
15 polypeptide sequence that is a human antibody, preferably a human 
germline antibody, onto the target structural template. 

According to any of the above method, the tester polypeptide 
sequence may be the sequence or a segment sequence of an expressed 
protein, preferably an antibody, more preferably a human antibody, and 
20 most preferably a human germline antibody sequence. 

According to any of the above method, the selecting step includes 
selecting the tester polypeptide sequence by using an energy scoring 
function selected from the group consisting of electrostatic interactions, 
van der Waals interactions, electrostatic solvation energy, solvent - 
25 accessible surface solvation energy, and conformational entropy. 

Optionally, the selecting step includes selecting the tester 
polypeptide sequence by using a scoring function incorporating a 
forcefield selected from the group consisting of the Amber forcefield, 
Charmm forcefield, the Discover cvff forcefields, the ECEPP forcefields, 
30 the GROMOS forcefields, the OPLS forcefields, the MMFF94 forcefield, 
the Tripose forcefield, the MM3 forcefield, the Dreiding forcefield, and 
UNRES forcefield, and other knowledge-based statistical forcefield 
(mean field) and structure-based thermodynamic potential functions. 

In yet another aspect of the invention, a method is provided for in 
35 silico selection of antibody sequences based on homology alignment 
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with a target sequence template. Remote homologues with diverse 
sequences but retaining the same functionally relevant structure can be 
selected by using profile hidden Markov Model (HMM) and PSI-BLAST. 
By using the method, a library of diverse antibody sequences can be 
5 constructed with a relatively smaller size than that constructed by 

complete randomization of the target sequence. This library can then be 
filtered using certain cutoff value based on, for example, the occurrence 
frequency of variants in each amino acid residue position, and screened 
experimentally in vitro or in vivo for antibody mutants with improved or 
10 desired function(s). 

In one embodiment, the method comprises the steps of: 
providing a target sequence of a heavy chain or light chain 
variable region of an antibody; 

aligning the target sequence with a tester polypeptide sequence; 

15 and 

selecting the tester polypeptide sequence that has at least 15% 
sequence homology with the target sequence. 

According to the method, the target sequence may be the full 
length of the heavy chain or light chain variable region, or a specific 
20 portion of the variable region, such as a CDR, a framework (FR) region 
and a combination thereof. 

Also according to the method, the aligning step includes aligning 
the target sequence with the polypeptide segment of the tester protein 
by using a sequence alignment algorithm selected from the group 
25 consisting of BLAST, PSI-BLAST, profile HMM, and COBLATH. 

Also according to the method, when the target sequence is a CDR 
region of the target antibody, the alignment step includes aligning any 
protein sequences that is of either human or non-human origin with the 
target sequence. 

30 Also according to the method, when the target sequence is a CDR 

region of the target antibody, the tester polypeptide sequence is a heavy 
chain or light chain variable region of a human or non-human antibody. 

Also according to the method, when the target sequence is a FR 
region of the target antibody, the tester polypeptide sequence is a heavy 
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chain or light chain variable region of a human antibody, preferably a 
human germline antibody sequence. 

Also according to the method, the selecting step includes 
selecting the polypeptide segment of the tester protein that has 
5 preferably at least 25%, preferably at least 35%, and most preferably at 
least 45% sequence homology with the target sequence. 

According to any of the above methods, the method further 
comprises: 

introducing the DNA segment encoding the selected tester 
10 polypeptide into cells of a host organism; 

expressing the DNA segment in the host cells such that a 
recombinant antibody containing the selected polypeptide or antibody 
sequence is produced in the cells of the host organism; and 

selecting the recombinant antibody that binds to a target antigen 
15 with affinity higher than 10 6 M- 1 . 

The recombinant antibody may be a fully assembled antibody, a 
Fab fragment, an Fv fragment, or a single chain antibody. 

The host organism includes any organism or its cell line that is 
capable of expressing transferred foreign genetic sequence, including 
20 but not limited to bacteria, yeast, plant, insect, and mammals. 

The target antigen to be screened against includes small 
molecules and macromolecules such as proteins, peptides, nucleic 
acids and polycarbohydates. 

25 BRIEF DESCRIPTION OF FIGURES 

Figure 1 illustrates a flow chart of a computer-implemented process 
that can be used in the present invention to construct an antibody 
library in silico. 

30 

Figure 2 shows 7 Vh and 7 Vl consensus sequences for 7 Vh and 7 V L 
framework of Hucal library in fasta format by Knappik et al., supra. 

Figure 3 shows the structures of the seven Vh sequences superimposed 
35 on each other. The structures are aligned by superimposing the Ca 
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atoms using the CE with RMSD < 2k and Z-score > 4. 

i 

Figure 4 shows (A) the Ca trace of the superimposed structures of 

i 

these 3 V H sequences (1DHA in green, 1DHO in cyan, and 1DHW in 
5 yellow); (B) the superimposed structures with a ribbon representation 
of the p- sheets of the VH frameworks. As shown in both Figures 4A 
and 4B, the 3 V H sequences (1DHA, 1DHO, and 1DHW) collapse into 
one structural family with RMSD <0.7 A and Z-score >6 using 1DHA as 
standard, even though their sequence identity ranges widely from 72% 
10 to 87% relative to 1DHA. 

I 

m 

Figure 5 shows the structures of the seven V L sequences retrieved from 
the PDB and superimposed on each other. The structures are aligned 
by using 1DGX as the reference structure with RMSD< 1.6A and Z- 
15 score >6. The seven Vl sequences have a wide range of conformational 
variability, especially in the CDR regions highlighted (The structural 
flexibility at N- and C-termini are discarded here). 



Figure 6 shows the superpositioned lDGX(green), 1DH4 (yellow), 1DH5 
20 (color cyan) and 1DH6 (magenta) with similar conformation but varying 
length in the CDR regions. By using the CE algorithm, four V L 
sequences (1DGX, 1DH4, 1DH5 and 1DH6) of the 7 consensus 
sequences families can be clustered into one structural family with 
RMSD < 0.6 A and Z-score >6 and with sequence identity ranging from 
25 67% to 80% using 1DGX as the structure reference. These four 
sequences also belong to the Vl kappa sequence family. 

Figure 7 shows three superimposed structures of 1DH7, 1DH8, and 
1DH9 in lamda variable light chain, can be clustered into 1 structure 
30 family with RMSD < 1 .5A and Z-score > 6 using 1DGX as the reference. 

Figure 8 shows in (A) that CDR1 regions of the three lamda (A.) Vl 
sequences (1DH7, 1DH8 and 1DH9) adopt similar conformations with 
RMSD < 1A. (B) that CDR1 regions of the 4 kappa (k) VL sequences 
35 (1DH4, 1DH6, 1DGX and 1DH5) adopt similar conformations with 

♦ 
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RMSD < 0.6A and gaps of 1-6 amino acids. (C) that CDR1 regions of 
the two kappa (k) V l sequences (1DH4 and 1DH6) adopt similar 
conformations with RMSD < 0.6A and 1 amino acid gap in CDR1. 
Thus, structures of these two kappa V L sequences are further clustered 
5 into one structural family according to the present invention. (D) that 
CDR1 regions of the two kappa (k) V l sequences (1DGX and 1DH5) 
adopt similar conformations with RMSD < 0.6A and 1 amino acid gap in 
CDRL 

* 

10 Figure 9 shows that clustering of the structures adopted by the seven 
consensus germline V L sequences based on the structural families in 
the CDR1 region led to two to three distinct families of antibody 
structures: (1DH7, 1DH8 and 1DH9) for lamda variable light chains, 
(1DH4 and 1DH6), and/or (1DGX and 1DH5) for kappa variable light 

15 chains. The members within each family adopt similar conformations 
in their CDR1 regions with varying length in amino acids. 

Figure 10 shows the PDB IDs of the consensus sequences of V H and V L , 
residues aligned, high score, P(N) sum, smallest probability, % identity 
20 with the query sequence, the germline family to which the identified 
germline sequence belongs. 

Figure 11 shows the homology alignment for each of the selected 
human antibody germline sequences with the query sequence. 

25 

Figure 12A shows the flow chart for selecting the optimal remote 
homologous sequence(s) of structure-based multiple sequence 
alignment by using the profile Hidden Markov Model (HMM). 

30 Figure 12B shows results generated by using the method diagramed in 
Figure 12A targeting Vh framework regions. 

* 

Figure 12C shows results generated by using the method diagramed in 
Figure 12A targeting kappa V L CDR1 . 

35 
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Figure 13 shows the top sequences from germline gene segments 
selected using the profile HMM method for various re-clustered 
structures. 

* 

5 DEFINITION 

Structural family: a group of structures that are clustered into a family 
based on some empirically chosen cutoff values of the root mean square 
deviation (RMSD) (for example, their Ca atoms of the aligned residues) 
10 and statistical significance (Z-score). These values are empirically 

decided after an overall comparison among structures of interest. For 
example, for CE algorithms, the starting criteria used are RMSD < 2 A 

and Z-score > 4. ' 

. ■ 

1 5 Structural ensemble : It is well known that in the structural 

determination by NMR (nuclear magnetic resonance), the ensemble of 
structures rather than a single structure, with perhaps several 
members, all of which fit the NMR data and retain good 
stereochemistry, is deposited with the Protein Data Bank (PDB). 

20 Comparison between the models in this ensemble provides some 

information on how well the protein conformation was determined by 
the NMR constraints. In structural clustering, it is important to analyze 
the all members within a structural cluster to understand some 
consensus information about the distribution of all structural templates 

25 within a family and constraints on their sequences or sequence profiles 
within a structural family. It should be pointed out that all the 
sequences corresponding to NMR-determined ensemble structures have 
the same sequences (one protein with variable conformations). The 
structural ensemble here in the present invention refers to different 

30 proteins with variations in sequence and/or length but have similar 
main chain conformations. 

Ensemble averaee or representative structure : if all members within a 
structural cluster has the same length of amino acids, the positions of 
35 atoms in main chain atoms of all structures are averaged, and the 
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average model is then adjusted to obey normal bond distances and 
angles ("restrained minimization"), similar to NMR-determined average 
structure. If all members within a structural cluster vary in the length 
of amino acids, a member which is representative of the average , 
5 characteristics of all other members within the cluster will be chosen as 
the representative structure. 

Canonical structures : the commonly occurring main-chain 
conformations of the hypervariable regions. 

10 

Structural repertoire : the collection of all structures populated by a 
class of proteins such as the modular structures and canonical 
structures observed for antibody framework and CDR regions. 

1 5 Sequence repertoire : collection of sequences for a protein family. 

Functional repertoire : a collection of all functions performed by 
proteins, such as the antibodies' diverse functional epitopes that are 
capable of binding to various antigens. 

20 

Germline gene segments : refers to the genes from the germline (the 
haploid gametes and those diploid cells from which they are formed). 
The germline DNA contain multiple gene segments that encode a single 
immunoglubin heavy or light chain. These gene segments are carried in 

25 the germ cells but cannot be transcribed and translated into heavy and 
light chains until they are arranged into functional genes. During B-cell 
differentiation in the bone marrow, these gene segments are randomly 
shuffled by a dynamic genetic system capable of generating more than 
10 8 specificities. Most of these gene segments are published and 

30 collected by the germline database. 

Rearranged immunoglobulin sequences : the functional 
immunoglobulin gene sequences in heavy and light chains that are 
generated by transcribing and translating the germline gene segments 
35 during B-cell differentiation and maturation process. Most of the 
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rearranged immunoglobulin sequences used here are from Kabat-Wu 
database. 

BLAST: Basic Local Alignment Search Tool for pairwise sequence 
5 analysis. BLAST uses a heuristic algorithm with position-independent 
scoring parameters to detect similarity between two sequences. 

PSI -BLAST: The Position-Specific Iterated BLAST, or PSI-BLAST 
program performs an iterative search in which sequences found in one 

10 round of searching are used to build a score model for the next round of 
searching. In PSI-BLAST the algorithm is not tied to a specific score 
matrix. Traditionally, it has been implemented using an AxA 
substitution matrix where A is the alphabet size. PSI-BLAST instead 
uses a QxA matrix, where Q is the length of the query sequence; at each 

15 position the cost of a letter depends on the position with respect to the 
query and the letter in the subject sequence. Two PSI-BLAST 
parameters have been adjusted: the pseudocount constant default has 
been changed from 10 to 7, and the E-value threshold for including 
matches in the PSI-BLAST model has been changed from 0.001 to 

20 0.002. 

COBLATH: A method that combines PSI-BLAST with Threading method 
for fold recognition and query-template alignment. It might be used to 
compare the compatibility between sequences and structural templates. 

25 

Profile Hidden Markov Model (profile HMMsl: statistical models of the 
primary structure consensus of a sequence family. They use position- 
specific scores for amino acids and for opening and extending an 
insertion and deletion to detect remote sequence homologues based on 
30 the statistical description of the consensus of a multiple sequence 

alignment. The multiple sequence alignments are given either by the 
multiple sequence alignment program such as ClustalW or structure- 
based multiple sequence alignment given by structural clustering. 

35 Threading : a process of assigning the folding of the protein by 
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threading its sequence to a library of potential structural templates by 
using a scoring function that incorporates the sequence as well as the 
local parameters such as secondary structure and solvent exposure. 
The threading process starts from prediction of the secondary structure 
of the amino acid sequence and solvent accessibility for each residue of 
the query sequence. The resulting one-dimensional (ID) profile of the 
predicted structure is threaded into each member of a library of known 
3D structures. The optimal threading for each sequence-structure pair 
is obtained using dynamic programming. The overall best sequence- 
structure pair constitutes the predicted 3D structure for the query 
sequence. 



Reverse Threading: a process of searching for the optimal sequehce(s) 
from sequence database by threading them onto a given target 
15 structure and/or structure cluster. Various scoring functions may be 
used to select for the optimal sequence(s) from the library comprising 
protein .sequences with various lengths. 

Reverse Engineering : the procedure to select and construct sequence or 
20 sequence libraries that are compatible to the structural constraints is 
called reverse engineering including but not limited to reverse 
threading. 

Supervariable Region of Antibody : regions within antibody CDRs that 
25 show diverse structure, sequence and chain length variability compared 
to the other regions of CDRs or CDR ensembles which are relatively 
constant in structure, sequence and chain length. As exemplified in 
Figure 12C, the super- variability of a region of a specific CDR family 
can be exploited in CDR library construction. 



DETAILED DESCRIPTION OP THE INVENTION 



The present invention provides a system and method for efficient 
in silico selection and construction of fully human and human-derived 
35 antibody libraries. The process is carried out computationally (i.e., in 
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silico) in a high throughput manner by mining the ever-expanding 
databases of protein sequences of all organisms, especially human." The 
inventive methodology is developed by combining database mining of 
evolutionary sequences from nature with computational design of 
5 structurally relevant variants of the nature sequences. 

In one aspect of the invention, the methodology is implemented 
by a computer system which computationally selects those human 
antibody sequences based on three-dimensional structural ensemble 
and /or ensemble average represented by a limited, discrete number of 

10 classes (or clusters) of antibody structures. By using the method, a 
master library of human antibody sequences can be constructed to 
better represent all antibody structures in the vertebrate antibody 
repertoire that are functionally important for high affinity binding to a 
large variety of antigens and eliciting antibody-dependent cellular 

15 responses. 

In another aspect of the invention, the methodology is 
implemented by a computer system which computationally selects from 
the protein databases protein sequences, particularly antibody 
sequences, based on structural alignment with a target structural 

20 template. Diverse sequences which still retain the same functionally 
relevant structure as the target structural template can be selected by 
using reverse threading. By using the method, a library of diverse 
antibody sequences can be constructed and screened experimentally in 
vitro or in vivo for antibody mutants with improved or desired 

25 function(s). 

In yet another aspect of the invention, the methodology is 
implemented by a computer system which computationally selects from 
the protein databases protein sequences, particularly antibody 
sequences, based on homology alignment with a target sequence 

30 template. Remote homologues with diverse sequences but retaining the 
same functionally relevant structure can be selected by using structure- 
based sequence alignment methods such as profile hidden Markov 
Model (HMM). By using the method, a library of diverse antibody 
sequences can be constructed with a relatively smaller size than that 

35 constructed by complete randomization of the target sequence. This 
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library can then be thoroughly screened experimentally in vitro or in 
vivo for antibody mutants with improved or desired function(s). 

The inventive methodology can be used to design any protein 
with novel function or improved function over the target protein which 
5 serves as a lead in the process. In particular, mutant antibodies can be 
designed to include diverse sequences in the CDR regions, and to 
replace non-human sequences in the frameworks of the variable regions 
with human ones to reduce immunogenicity of the designed antibody 
when used as human therapeutics. 
10 The library constructed by using the inventive methodology 

provides a structurally diverse and yet functionally more relevant 
source of antibody candidates for further screening for novel antibody 
with high affinity against a wide range of antigens and having no or 

minimum immunogenicity to human subject treated with antibody 

« » 

15 therapeutics. 

1. Principles of in silico selection and construction of a master 
library of functionally representative human antibody 

20 Antibody is a unique class of proteins which play profound roles 

in a vertebrate's ability to defend itself against infection by neutralizing 
(or inactivating) viruses and bacterial toxins, and by recruiting the 
complement system and various types of white blood cells to kill 
extracellular microorganisms and larger parasites. 

25 Like every protein of some biological significance, the biological 

functions of the proteins depend directly on the three-dimensional (3D) 
structure of the protein. The 3D structure or conformation determines 
the activity of an enzyme, how a receptor interacts with its ligand, and 
the affinity of the binding between the receptor and ligand. Thus, it 

30 would be biologically more relevant to screen a library of protein such 
as antibody based on the 3D structure a particular protein sequence 
adopts rather than the primary DNA or amino acid sequence of the 
protein. 

In particular, as two of the most important handlers to map the 
35 functional space of proteins, the sequence and structure information of 
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antibodies have been accumulated for more than a few decades. 
Extensive analysis on their patterns have provided some of the most 
detailed understanding of fundamental process for molecular 
recognitions, which has a direct impact on the combinatorial technology 
5 in chemistry and biology. 

So far , major efforts in mapping the functional diversity of 
antibodies have been focused on capturing the complexity in antibody 
sequence space by either simply increasing the size of antibody 
sequence pool (the so-called one-pot approach) or by generating large 

10 synthetic libraries in CDR regions. Only recently has systematic 

analysis of antibody sequence repertoire been utilized to design highly 
diverse consensus sequence library based on highly used human 
germline sequences as observed in the rearranged human antibody 
sequences. Knappik et al., supra. These consensus sequences were 

15 further analyzed to account for the canonical structures for the CDR 
regions. 

In the present invention a distinctly novel approach is utilized to 
map the functional repertoire of antibody molecules. This approach is 
taken by exploiting the characteristics of antibody in sequence diversity 

20 and global structural conservation. 

It is recognized that although a protein may have astronomical 
number of possible conformations (about 1016 for a small protein of 
100 residues (Dill (1985) Biochem. 24:1501-1509), all antibodies adopt 
a characteristic "immunoglobulin fold" globally. The natural antibody 

25 repertoire shows an amazing ability in recognizing a wide variety of 
molecules. To confer such diverse functions of binding ability to a 
vertebrate's antibodies, an extremely diverse sequence repertoire (about 
10 12 possible combinations between the sequences of mouse heavy 
chain and light chains) is created by random genomic splicing of heavy 

30 and light chains with high variability in both sequence and length in 
their CDRs. 

The structural repertoire to accommodate the much larger 
sequence repertoire is, however, surprisingly small. Only a limited 
number of canonical backbone conformations are found to account for 
35 structures adopted by the CDRs that are docked onto highly conserved 
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immunoglobin scaffold. 

1) General approach 

The general approach for constructing a structure-based human 

5 library is illustrated by a flow chart in Figure 1. 

As illustrated in Figure 1, antibody structures and models in 
various protein structure databases such as the PDB are collected. The 
structural repertoire of these antibody molecules are mapped out in 
their three dimensional shape space. It is believed that conservation 

10 and variation in the shape space should make it possible to develop 

some general frameworks that remain constant across different species. 
On the other hand, variation in the shape space should make it 
possible to capture the functional diversity of antibody against a wide 
array of antigens in specific antibody regions. 

15 Referring to Figure 1, the variable regions in shape space are 

clustered either separately (such as CDR3) or in combination (CDR1 & 
CDR2) into distinct families with or without certain conserved 
structural frameworks. 

Still referring to Figure 1, these structural clusters, the 

20 ensemble average, and/or their corresponding sequence profiles are 

used to map out the corresponding sequence in human germline (or in 
a rearranged antibody sequence database) to find optimal sequences or 
sequence profiles within each family. 

As diagramed in three boxes in the middle portion of Figure 1, at 

25 least three approaches can be taken to exploit the information 

generated by structure-based clustering of target antibody sequence(s). 
As described in the left box, one approach is to directly select 
sequences that fit onto the target structural template by using 
algorithms such as reverse threading, PSI-BLAST and profile HMM. For 

30 example, a library of recombinant antibodies can be generated by 1) 

selecting from a human antibody germline database sequence segments 
that fit onto a structural template of a target FR region of an antibody; 

2) selecting from a protein database sequence segments that fit onto a 
structural template of a target CDR region of the antibody; and 3) 

35 combining the selected FR and CDR segments to build the library of 
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recombinant antibodies which are then synthesized and screened 
against a target antigen in vitro or in vivo. 

As described in the right box in the middle section of Figure 1, 
another approach is to indirectly select antibody sequences using a 
5 target sequence or sequence profile built based on a structural template 
of a target antibody. For example, a library of recombinant antibodies 
can be generated by 1) aligning the target sequence or sequence profile 
with tester sequences from a protein database (e.g., human germline 
antibody sequence database or PDB) by using BLAST or multiple 

10 sequence alignment methods such as profile HMM; 2) selecting the 

tester sequences with homology to the target sequence (e.g., sequence 
homology of at least 15%); and optionally 3) evaluating the structural 
compatibility of the selected tester sequence with the structure template 
of the target sequence or sequence profile. This process can be carried 

15 out to construct a library of recombinant antibodies by targeting a 

particular region of an antibody such as a CDR, FR, and combination 
thereof. The selected tester sequences may be profiled based on 
variability in each amino acid residue and those variants with low 
occurrence frequency (e.g., 5 times out of 100 selected tester sequences) 

20 may be filtered and discarded. The rest of the selected tester sequences 
may be pooled and combined by a combinatorial combination of the 
amino acid variants in each residue position. The tester sequences 
selected by targeting the CDR region and the ones targeting the FR 
regions may also be combined; and the combined sequences may be 

25 filtered based on their structural compatibility with the target antibody. 
The library of recombinant antibodies can be synthesized and screened 
against a target antigen in vitro or in vivo. 

As described in the middle box in the middle section of Figure 1, 
yet another approach is to select antibody sequences based on a target 

30 structural template combining methods described in the left and right 
boxes. For example, a library of recombinant antibodies can be 
generated by 1) aligning the sequence or sequence profile of the target 
structural template with tester sequences from a protein database (e.g., 
human germline antibody sequence database or PDB) and reverse 

35 threading the tester sequences onto the target structural template by 
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using a structure /sequence dual selection algorithm such as 

* 

COBLATH; and 2) selecting the tester sequences with homology to the 
target sequence (e.g., sequence homology of at least 15%) and 
structurally compatible with the target structural template. This 
process can be carried out to construct a library of recombinant 
antibodies by targeting a particular region of an antibody such as a 
CDR, FR, and combination thereof. The selected tester sequences may 
be profiled based on variability in each amino acid residue and those 
variants with low occurrence frequency (e.g., 5 times out of 100 selected 
tester sequences) may be filtered and discarded. The rest of the 
selected tester sequences may be pooled and combined by a 
combinatorial combination of the amino acid variants in each residue 
position. The tester sequences selected by targeting the CDR region 
and the ones targeting the FR regions may also be combined; and the 
combined sequences may be filtered based on their structural 
compatibility with the target antibody. The library of recombinant 
antibodies can be synthesized and screened against a target antigen in 
vitro or in vivo. 

There are several advantages associated with this approach of 
mapping the functional space of proteins using diversity libraries that 
are designed by sampling the diversity in shape space rather than in 
sequence space. 

First, protein-protein interactions between ligand and receptor, 
antigen and antibody are conducted in well-defined conformation in 
space. Therefore, antibody libraries should be designed to map the 3- 
dimensional space populated by antibodies in order to target antigens 
with different shapes. 

Second, compared to the larger sequence repertoire, structure 
repertoire of antibodies is limited to a small number of canonical 
structures in its main chain conformations in the CDR regions which 
are docked onto a common core structure for both the variable light and 
heavy chains. The simplicity in structure repertoire makes it easy to 
map the functional diversity based on variation in its 3-dimensional 
space and simple to cluster seemly complicated sequence pools into 
distinct families for library construction. 
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Third, it is conceived that the conserved nature of the structural 
repertoire of immunoglobins across very different species (Barre et al. 
(1994) "Structural conservation of hypervariable regions in 
immunoglobins evolution 1 ' Nature Struct. Biol. 1:915-920) that 
5 clustering structure repertoire of antibodies from different species into 
distinct families is a viable approach to map its functional space. This 
approach is simple yet functionally more relevant for selecting and 
constructing the diversity libraries once it is applied to the sequence 
repertoire for a specific species. This is particularly important for 

10 constructing human antibody libraries for therapeutic application or for 
humanizing murine antibodies by using human-derived sequence 
repertoire for its counterparts. In contrast, sequence homology-based 
approaches would be less flexible and hard to transfer from spedes to 
species if sequence homology is relatively low. 

15 Moreover, the structure-based construction of sequence libraries 

makes it possible to apply various methods developed in structural 
biology to filter apparent complexity in sequence spaces based on 
structural or physical principles, in addition to the tools used in 
sequence analysis that are largely relied on the principles of evolution. 

20 Accordingly, the present invention provides a method of 

constructing a master library of functionally representative antibody. 
This master library is formed by a repertoire of antibody sequences 
adopting distinct classes of structures that covers, ideally, almost all of 
the 3D structural ensembles and/or ensemble averages of all vertebrate 

25 antibodies. 

According to the present invention, a master library of 
functionally representative antibody is represented by a library of 
antibody sequences adopting distinct classes of structures that covers, 
ideally, almost all families of the 3D structures of all vertebrate 

30 antibodies. Although a protein may have astronomical number of 

possible sequence combinations (about 10 16 for a small protein of 100 
residues (Dill (1985) Biochem. 24:1501-1509), all antibodies adopt a 
characteristic "immunoglobulin fold" globally. The natural antibody 
repertoire shows an amazing ability in recognizing a wide variety of 

35 molecules. To confer such diverse functions of binding ability to a 
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vertebrate's antibodies, an extremely diverse sequence repertoire (about 
10 12 possible combinations between the sequences of mouse heavy . 
chain and light chains) is created by random genomic splicing of heavy 
and light chains with high variability in both sequence and length in 
5 their CDRs. 

The structural repertoire to accommodate the much larger 
sequence repertoire is, however, surprisingly small. Only a limited 
number of canonical backbone conformations are found to account for 
structures adopted by the CDRs that are docked onto highly conserved 

10 immunoglobulin scaffold. 

According to the present invention, it is believed that antibody 
achieves its functional diversity by decorating a diverse array of amino 
acids onto a finite number of CDR canonical structures. The present 
invention clusters antibodies with experimental or modeled structures 

15 into distinct families. By clustering the antibodies according to their 3D 
structures instead of using conventional methods of classification based 
on sequence homology, each family of the structure repertoire should 
better represent the population of antibodies with binding geometry 
complementary to the recognition sites of potential antigens, although 

■ 

20 the binding affinity could be further optimized by matching the shape 
and chemical nature of the specific amino acids. Therefore, the 
approach taken in the present invention tends to maximize the 
functional diversity of antibody in recognizing and binding to a wide 
array of antigens in silico and meanwhile to minimize the sequence 

25 space required for efficient screening in vitro or in vivo. 

2) Construction of antibody sequence library based on structural 
constraints 

30 " Once structural families are identified, either the cluster 

containing multiple members, a representative member, or an ensemble 
average of the cluster if possible, can be used as structural constraints 
to either select for optimal sequences or to construct sequences for 
further constructing sequence libraries. 

* 

35 There are several ways to use these structural families from 
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sampling antibody structure databases as the constraints for 
constructing desired sequence libraries. The main chain conformations 
of 3D structures within a structural family or cluster are called 
structure ensembles or structure templates. The ensemble average is 
5 referred to the average structure of all members within a structure 

cluster or family when it is physically meaningful to take average of the 
main chain conformations. If it is not physically meaningful or possible 
to take average for all members within a structural cluster or family due 
to the difference in length, etc. a representative structure may be used 

10 to represent the "average structural properties'' of all members within a 
structural family or cluster. The structure ensembles or templates, 
ensemble average, or representative structures described above are 
collectively referred to herein as the "structural templates". 

The difference in using these terms in describing structural 

15 constraints depends on how much structural constraint within a 
cluster should be included in constructing sequence libraries. For 
structural constraints, the most stringent and reasonable approach 
should be to include all ensemble structures or templates within a 
cluster or family. The ensemble average if done properly, may be the 

20 simplest structural constraint and easy to compute. If taking ensemble 
average is not physically meaningful, the representative structure may 
be a compromise to replace constraints by structure cluster. 

Once the structural constraints are identified, there are several 
ways to construct sequence libraries by applying structural constraints. 

25 The procedure to select and construct sequence or sequence libraries 
that are compatible to the structural constraints is called reverse 
engineering including but not limited to reverse threading. However, an 
important aspect of current invention is to restrict the sequence 
database for library construction to specific species and/or to even the 

30 specific population of the same species. For therapeutic purpose, the 
human immunoglobulin sequence database are preferably used to 
construct human-derived antibody libraries, especially in the 
frameworks of the variable regions. In the CDR regions, sequences with 
non-human origins may optionally be used to increase the diversity of 

35 these regions so as to increase the chance of finding antibodies with 

* 

-28- 



% 

WO 02/084277 PCT/US02/12202 

novel or improved function(s). The methods in applying both the 
physical and evolutionary constraints to construct sequence libraries 
are described in detail below. 

One method is to use the sequence that is compatible to the 
5 ensemble average structure or the representative member within a 

structure cluster to search for the optimal sequences from the germline 
sequence database. This will usually yield the sequence with the 
highest sequence identity to the query sequence using BLAST as 
demonstrated in Section 3 below (Figures 10 and 11). 
10 The clustered structures within a structure family can give 

multiple sequence alignments based on 3D structures. These aligned 
sequences might come from different species; they may be close or 
remote sequence homologues. The multiple sequence alignment can be 
used, however, to build a profile Hidden Markov Model (HMM); and this 
1 5 HMM will then be used to search for the close and/or remote human 
homologues from human sequence database such as the human 
germline and/or rearranged sequence database as demonstrated in 
Section 3 below (Figures 12 and 13A-C). 

A more direct way to search for sequences compatible to 

* 

20 structural constraints is to thread amino acid sequences from human 
germline and/or rearranged sequence database onto structure 
templates of the structural cluster and to find out the optimal scoring 
sequences on their target structure templates. These sequences can 
be then used for constructing sequence libraries for the structure 

25 cluster. This procedure is called reverse threading because it tries to 
find the best sequences fitting to the target structure templates, which 
is the opposite of threading which tries to find the best structure 
template from a structure library for a given sequence. 

Additionally, the top hits of the sequences found for a structure 

30 cluster or queried sequences may be profiled by threading multiple 

amino acids at each position in a combinatorial approach to select for 
the best "consensus sequence" compatible with the structural ensemble 
and/or ensemble average. This process of searching for consensus 
sequence is different from the consensus sequence from the method of 

35 using simple sequence average at each position described in Knappik, 
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et al, supra. The consensus sequence according to the present 

• * 

invention is created using the physically oriented reverse engineering 
approach using all possible combination of amino acids that are allowed 
at each position from the retrieved sequences but are optimized ,by 
5 scoring their compatibility with the structural constraints. 

The human antibody sequences that are selected according to 
these criteria for the framework regions can serve as the sequence 
template for building a master framework for constructing the human 
antibody library of the present invention. These selected human 
10 sequences are then pooled together and included in the master 

framework. The same methods can be used to construct the sequence 
libraries for CDR regions if the structure templates for each canonical 
structure family of CDRs are used to construct the sequence libraries 
for these regions. 

1 5 Once the master framework of human antibody is constructed, 

mutagenesis can be carried out to diversify specified region(s) in the 
master framework. For example, CDR regions, especially CDR3 of the 
heavy chain, of the master framework can be randomly mutagenized to 
mimic the natural process of antibody diversification. The mutagenized 

20 antibody sequence may be further selected in silico based their 

compatibility to the structural ensemble average. All of the antibody 
sequences selected in these processes are pooled to form a master 
library of human antibody which can be screened against a wide range 
of antigens in vitro or in vivo. 

25 Since the selection and construction of the antibody library of the 

present invention is based on structural clustering, not simple 
sequence homology alignment, it is thus possible to further limit the 
number of antibody sequences in the library and yet not to sacrifice the 
functionally relevant sequences. For example, multiple human 

30 antibody sequences may be highly diverse in their sequences and yet 
adopt the same 3D structure when threaded onto the structural 
ensemble average. 

Further, for those antibody sequences mutagenized randomly in 
the CDR3 region, not all structures of randomized CDR3 are compatible 

35 with the framework structural ensemble averages. Consequently, a 
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fewer number of CDR3 loops that are structurally diverse will be 
selected, and therefore a fewer number of human sequences selected. 
As a result, the sequence space of antibody to be screened is reduced in 
size without sacrificing diversity in antibody functionality. 
5 By using the method, a master library of human antibody 

sequences is selected and constructed to better represent all antibody 
structures in the vertebrate antibody repertoire that are functionally 
important for high affinity binding to antigen and eliciting antibody- 
dependent cellular responses. Such a functionally representative 
10 library provides a structurally diverse, and yet functionally more 

relevant source of antibody candidates which can then be screened for 
binding to a wide variety of target molecules, including but not limited 
to biomacromolecules such as protein, peptide, and nucleic acids, and 
small molecules. 

15 The method of present invention is an efficient way of 

constructing a digital library of antibody which represents most of the 
3D structures of antibodies that are functionally relevant. Thus, the 
human antibody sequences selected from the reverse engineering 
process such as threading are finite and yet covers most of the 

20 functionally relevant structures of antibody in human antibody gene 
pool. 

In contrast, the current methods of construction of antibody 
library in vitro involve isolation of cDNA libraries from immunized 
human antibody gene pool, naive B-cell Ig repertoire, or particular 

25 germline sequences. Barbas and Burton (1996), supra; De Haard et al. 
(1999), supra; and Griffiths et al (1994), supra. These libraries are very 
large and extremely diverse in terms of antibody sequences. 

The conventional approach is to create a library of antibody as 
large, and as diverse as possible to mimic immunological response to 

30 antigen in vivo. Typically, these large libraries of antibody are displayed 
on phage surface and screened for antibodies with high affinity binding 
ability to a target molecule. Such a "fishing in a large pond" or "finding 
a needle in a huge hey stack" approach is based on the assumption that 
simple increase in the size of sequence repertoire should make it more 

35 likely to fish out the antibody that can bind to a target antigen with 
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high affinity. 

There may be several problems associated with such a 
conventional approach. A simple increase in the size of sequence 
library may not necessarily correlate with an effective increase in 
5 functional diversity. Further, due to the physical limit on making an 
extremely large experimental library, it may be very difficult to 
construct a library with diversity over 10 11 in vitro in the lab. The 
library that is actually screened experimentally probably presents only a 
fraction of the sequence repertoire at the theoretically predicted size. In 

10 addition, there is legitimate concern that with the difficulties and the 
under representation problems associated with handling and 
manipulation of an extremely large library in vitro, time and money may 
be lost in an effort trying to increase the size of the library and yet not 
increasing functional diversity significantly. 

15 Another approach existing in the art is to design an artificial 

antibody library computationally and then construct a synthetic 
antibody library which is expressed in bacteria. Knappik et al., supra. 
The artificial antibody library was designed based the consensus 
sequence of each subgroup of the heavy chain and light chain 

20 sequences according to the germline families. The consensus was 

automatically weighted according to the frequency of usage. The most 
homologous rearranged sequences for each consensus sequence was 
identified by searching against the compilation of rearranged 
sequences, and all positions where the consensus differed from this 

25 nearest rearranged sequence were inspected. Furthermore, models for 
the seven Vh and seven Vt consensus sequences were built and 
analyzed according to their structural properties. A library of artificial 
antibodies were then constructed and expressed in E. coli. This library 
constructed can be used, to screen for antibodies with high affinity 

30 binding to a target molecule. 

However, there is a major problem concerning such an approach 
as far as therapeutic applications of the selected antibody are 
concerned. Although derived from human sequence pools, the 
consensus sequences found by using this approach, by definition, are 

35 not natural sequences. (1) Combination of sequences, albeit human 
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sequences, at various positions may give rise to new immunogenic 
epitopes, thus significantly limiting therapeutic applications of the 
selected antibodies to human, whereas the method described here can 
give either fully human sequences or human derived sequences or both. 
5 (2) Consensus sequence has its own serious limitation. Moreover, the 
definition of consensus sequence may be too arbitrary and such 
artificial sequences defined may not be representative of a natural, 
functional structure, although experimental test and structure analysis 
may eliminate some unfavorable amino acid combinations. (3) 

10 Although the consensus sequences designed to cover mainly those 

human germline sequences that are highly used in rearranged human 
sequences, it might bias consensus sequence library toward a limited 
number of antigens exposed to human being so far, whereas sampling 
functional space by mapping structures of different species covers a 

15 wide range of functional epitopes of antibodies exposed to a wide array 
of antigens. This would be very important for designing antibody 
libraries to target novel antigens. 

By contrast, the method of the present invention is based on 
structural constraints of antibodies directly or derived from natural 

20 sources. According to the present invention, a complete structural 

repertoire of all antibodies available including both human and other 
vertebrates can be analyzed for structural ensembles and/or ensemble 
averages within each representative 3D structural family. Based on 
this analysis, the structural models are clustered into distinct structure 

25 families, each of which includes one or more representative members. 
These structure families ideally should represent evenly the structure 
space which all antibodies, including those from humans and other 
animals, would adopt. Thus, by collecting and building structural 
models for each structural ensembles and/or their ensemble averages 

30 for these antibodies, a relatively comprehensive survey of functional 
repertoire of antibodies across the species may be achieved. 

Further, the method of the present invention involves using 
selection of native human antibody sequence which fits the best onto 
the structural ensemble or ensemble average in each of the structural 

35 family. By selecting and pooling the native human sequences based on 
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the 3D structural templates in each family, a more focused human 
antibody library is created. The library may be smaller than the native 
antibody gene pool and yet representative of the functional repertoire of 
antibody in all vertebrates. 
5 Moreover, the sequences of the antibody library constructed 

using the method of the present invention are closely related to human 
sequences. The antibody selected from this library against a target 
molecule should be more desirable than an artificial, non-human 
antibody for therapeutic applications and humanization of non-human 

10 antibodies. This approach can minimize the potential of creating new 
immunogenic epitopes associated with using synthetic antibody 
sequences derived from randomization of the consensus sequences. 

In addition, the library generated according to the method of the 
present invention should encompass a broader spectrum of the basic 

15 function of an antibody: antigen recognition and neutralization. Since 
the family of the structural ensemble averages are clustered based on 
not only the structures adopted by known human sequences, but also 
structures collected from other vertebrates. In particular, monoclonal 
antibody produced by mice is a rich source of structures to be included 

20 in the process of clustering. Since these monoclonal antibodies are 
generated from immunization against a vast number of antigens, 
including these antibodies in the clustering process should tend to 
enlarge the functional repertoire, although a few special features 
specific to mice should be taken into account or avoided when applied 

25 to human. This approach may effectively avoid the problem associated 
with known human antibody sequences that are restricted to those 
isolated against a limited number of antigens. 

2. Process of clustering antibodies based on their structural 
ensemble or ensemble averages* 

According to the present invention, a master library of human 
antibody sequence can be constructed based on 3D structural clusters 
of antibodies from human and other vertebrates. The 3D structural 
ensembles and/or the ensemble averages serve as master frameworks 
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upon which human antibody sequences are mapped onto by threading 
etc and those best compatible are selected to form the master library of 
human antibody. 

The structural ensemble or ensemble averages of antibody from 
5 various species may be modeled in silico by using various structural 

alignment methods for comparing antibodies with known 3D structures. 
By "known 3D structures" is meant x-ray crystal structures, NMR 
structures, and 3D structures of antibody modeled in silico. Currently, 
there are about 360 antibody 3D structures deposited in the Protein 

10 Data Bank (PDB) which include 306 X-ray structures, 17 NMR 
structures, and 32 modeled structures. 

For example, antibody structural cluster can be generated by 
pairwise structural alignment for V H or V L of two or more antibodies 
with known 3D structures from the PDB. Various algorithms have been 

15 developed for protein structure alignment, including those attempting 
global optimization of the alignment path for some similarity measure 
using dynamic programming (Orengo et al. (1992) Proteins 14:139-167), 
Monte Carlo (Holm and Sander (1993) J. Mol. Biol. 233:123-138), 3D 
clustering (Fischer et al. (1992) J. Biomol. Struct. Dyn. 9: 769-789; and 

20 Vriend and Sander (199 1) Proteins 1 1 : 52-58) and graph theory 

(Alexandra v (1996) Protein Eng. 9: 727-732), and algorithm using 

■ 

incremental combinatorial extension (CE) of the optimal path 
(Shindyalov and Bourne (1998) Protein Eng. 9:739-747; and Shindyalov 
and Bourne (2001) Nucleic Acid Res. 29:228-229). 

25 In an embodiment of the present invention, the antibody 

structural families are clustered by structural alignment using the CE 
algorithm. Compared to Monte Carlo and 3D clustering algorithms, the 
CE algorithm significantly reduces the search space and empirically 
establishes a reasonable target function. The CE target function 

30 assumes that alignment path is continuous when including gaps and 
there is an optimal match between the pair. Various protein properties 
can also be used with CE algorithm, for example, 1) structure 
superposition as rigid bodies; 2) inter- residue distance, 3) 
environmental properties (e.g., exposure, secondary structure); 4) 

35 conformational properties (e.g., bond angles, dihedral angles, and 
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orientation with respect to the protein center of mass). 

As a proof of principle, 3D structures of a series of artificial ' 
antibody sequences were compared by using the CE algorithm and 
classified into a smaller number of clusters based on their 3D 
structural alignments. These, artificial sequences tested are the 
consensus sequences of the subgroups of the heavy and light chain 
sequences according to the germline families. Knappik et al., supra. 
These sequences shown in Figure 2 consist of the following 7 VH and 7 
VL consensus sequences: 



10 



V H V l 

1DHA [SEQ ID NO: 1] 1DGX [SEQ ID NO: 8] 

1DHO [SEQ ID NO: 2] 1DH4 (SEQ ID NO: 9] 

15 1DHQ [SEQ ID NO: 3] 1DH5 (SEQ ID NO: 10] 

1DHU (SEQ ID NO: 4] 1DH6 [SEQ ID NO: 1 1] 

1DHV [SEQ ID NO: 5] 1DH7 [SEQ ID NO: 12] 

1DHW [SEQ ID NO: 6] 1DH8 [SEQ ID NO: 13] 

1DHZ [SEQ ID NO: 7] 1DH9 (SEQ ID NO: 14] 

20 

The seven Vh consensus sequences stored in the PDB, 1DHA, 
1DHO, 1DHQ, 1DHU, 1DHV, 1DGW, and 1DHZ, correspond to VH1A, 
VH1B, VH2, VH3, VH4, and VH6, respectively, as described in Knappik 
et al., suprcu The seven Vl consensus sequences stored in the PDB, 

25 1DGX, 1DH4, 1DH5, 1DH6, 1DH7, 1DH8, and 1DH9, correspond to 
VLkI, VLk2, VLk3, VLk4, V1A1, V1A2, and VIA3, respectively, as 
described in Knappik et al., supra. 

The 3D structural models of these VH and VL consensus 
sequences built by Knappik et al. were retrieved from the PDB and 

30 compared by using the CE algorithm. It should be noticed that CDR3 of 
the heavy and light chains were the same for all frameworks in the 
modeled structures. The CE program compares pairs of protein 
structures of polypeptide chain or their segments based on the root 
mean square difference (RMSD), their statistical significance (Z-score), 

35 length difference, allowable gaps (given as a percentage of the total 
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number of residues without a matching partner relative to the complete 
alignment) and sequence identity. 

Figure 3 shows the structures of the seven VH sequences 
superimposed on each other. The structures. are aligned by 
5 superimposing the Ca atoms using the CE with RMSD < 2k and Z-score 
> 4. As shown in Figure 3, the seven VH sequences have a range of 
conformational variability, especially in the CDR regions. According to 
Knappik et al., these seven structures cover all canonical classes of the 
CDRs of the VH structures. 

10 However, by using the method of the present invention, a closer 

look into the seven structures reveals a striking conformational 
similarity between at least three of the seven VH sequences. By using 
the CE algorithm, five VH sequences (1DHA, 1DHO, 1DHW, 1DHZ, 
1DHV) of the 7 consensus sequences families can be clustered into one 

15 structural family with RMSD < 1.5A and Z-score > 4 and with sequence 
identity ranging from 48% to 87% using Idha as standard. Further 
clustering of the 5 VH sequences (IDHA, 1DHO, 1DHW, 1DHZ, 1DHV) 
reveals that the 3 VH sequences (IDHA, 1DHO, and 1DHW) collapse 
into one structural family with RMSD <0.7 A and Z-score >6 using 

20 IDHA as standard, even though their sequence identity ranges widely 
from 72% to 87% relative to IDHA. 

Figure 4A shows the Ca trace of the superimposed structures of 
these 3 VH sequences (IDHA in green, 1DHO in cyan, and 1DHW in 
yellow). Figure 4B shows the superimposed structures with a ribbon 

25 representation of the P-sheets of the VH frameworks. As shown in both 
Figures 4A and 4B, these three structures have an almost perfect 
superposition (RMSD <0.7 A) even in the CDR regions. According to the 
present invention, these three structures are clustered into one VH 
structure family based on the structural clustering criteria of the 

30 present invention. The rest of the 7 VH sequences: 1DHQ, 1DHU, 
1DHV, and 1DHZ, have distinctly different structures and thus 
clustered into 4 distinct structural families with only one member 
within each family according to the present invention. Thus, by using 
the method of the present invention, the 7 consensus germlines VH 

35 sequences of human antibody designated by Knappik et al. can be 
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presented by 5 distinctly different structural families. The preferred 
criteria are RMSD < 1 A for each structural family and Z-score > 6. 

Figure 5 shows the structures of the seven VL sequences 
retrieved from the PDB and superimposed on each other. The 
5 structures are aligned by using 1DGX as the reference structure with 
RMSD< 1 .6A and Z-score >6. As shown in Figure 5, the seven VL 
sequences have a wide range of conformational variability, especially in 
the CDR regions (The structural flexibility at N- and C-termini are 
discarded here). According to Knappik et ah, these seven structures 

10 cover all canonical classes of the CDRs of the VL structures. 

However, by using the method of the present invention, the seven 
VL sequences can be re-clustered into smaller number of families. By 
using the CE algorithm, four VL sequences (1DGX, 1DH4, lDHSand 
1DH6) of the 7 consensus sequences families can be clustered into one 

1 5 structural family with RMSD < 0.6 A and Z-score >6 and with sequence 
identity ranging from 67% to 80% using 1DGX as the structure 
reference. Figure 6 shows the superimposed lDGX(green),lDH4 
(yellow), 1DH5 (color cyan) and 1DH6 (magenta) with similar 
conformation but varying length in the CDR regions. These four 

20 sequences also belong to the VL kappa sequence family. 

Further clustering of the 4 VL sequences (1DGX, 1DH4, 1DH5 
and 1DH6) reveals that the 2 VL sequences (1DH4 and 1DH6) collapse 

* 

into a structural family with RMSD <0.6 A and Z-score > 6 with length 
of CDR1 loop closer to each other, using 1DGX as the reference, while 

25 two VL sequences (1DGX and 1DH5) can be clustered into another 
structural family (data not shown). 

Figure 7 shows three superimposed structures of 1DH7, 1DH8, 
and 1DH9 in lamda variable light chain, can be clustered into 1 
structure family with RMSD < 1.5A and Z-score > 6 using 1DGX as the 

30 reference according to the present invention. 

Thus, by using the method of the present invention, the 7 
consensus germlines VL sequences of human antibody designated by 
Knappik et al. can be represented by 2 to 3 distinctly different 
structural families. Combined with the clustering of. the 7 consensus 

35 germlines VH sequences into a 5 structural families, the total structural 
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family for human antibody germline can be represented by 5 x (2 to 3) - 
10 to 15 distinct families, a much reduced structural repertoire than 

♦ 

the germline sequence repertoire of Knappik et al.: 7x7=49. 

The structures of the consensus germline VH and VL sequences 
5 can also be clustered based on the conformation ensemble adopted by a 
specific region of the VH or VL, such as a particular CDR region. 

Figure 8A shows that CDR1 regions of the three lamda (X) VL 
sequences (1DH7, 1DH8 and 1DH9) adopt similar conformations with 
RMSD < lA. Thus, structures of these three lamda VL sequences are 
10 clustered into one structural family according to the present invention. 

Figure 8B shows that CDR1 regions of the 4 kappa (k) VL 
sequences (1DH4, 1DH6, 1DGX and 1DH5) adopt similar conformations 
with RMSD < 0.6A and gaps of 1-6 amino acids. Thus, structures of 
these four kappa VL sequences are clustered into one structural family 
15 according to the present invention. 

Figure 8C shows that CDR1 regions of the two kappa (k) VL 
sequences (1DH4 and 1DH6) adopt similar conformations with RMSD < 
0.6A and 1 amino acid gap in CDR1. Thus, structures of these two 
kappa VL sequences are further clustered into one structural family 
20 according to the present invention. 

Figure 8D shows that CDR1 regions of the two kappa (k) VL 
sequences (1DGX and 1DH5) adopt similar conformations with RMSD < 
0.6A and 1 amino acid gap in CDR1. Thus, structures of these two 
kappa VL sequences are further clustered into one structural family 
25 according to the present invention. 

As a result of such clustering with a focus on a specific region of 
the VH or VL, regions, the number of antibody structure families might 
be clustered differently. Figure 9 shows that clustering of the 
structures adopted by the seven consensus germline VL sequences 
30 based on the structural families in the CDR1 region led to two to three 
distinct families of antibody structures: (1DH7, 1DH8 and 1DH9), 
(1DH4 and 1DH6), and/ or (1DGX and 1DH5). As shown in Figure 9, 
within each family, the members adopt similar conformations in its 
CDR1 regions with varying length in amino acids. 
35 Thus, by further clustering of antibody structures based on a 
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more focused region of the global structure, i.e., CDR1, the seven 
consensus germline VL sequences of human antibody designated by 
Knappik et al. can also be represented by 2 to 3 distinctly different 
structural families. Combined with the clustering of the 7 consensus 
5 germlines VH sequences into 5 structural families, the total structural 
families for human antibody framwork sequences can be represented by 
5x (2 to 3) ■» 10 to 15 distinct families, a much more reduced structural 
repertoire than the consensus framwork sequence repertoire of Knappik' 
et al.: 7x7=49. 

10 As illustrated by the above example, the method of the present 

invention enables one to reduce the size of the antibody sequence 
library by clustering them according to their 3D structural families. 
Since the structure of a protein or antibody determines its function in 
the biological system, the structural ensemble or ensemble average in 

15 each structure family of the present invention should represent the 

population of diverse antibody sequences sharing similar functions, e.g. . 
in antigen recognition and affinity binding. 

The above-described method of clustering structures of 
consensus human germline antibody sequences only serves as an 

20 example to illustrate the principal of the invention. It should be noted 
that such a clustering method is not limited to these structures. In a 
broader application, structures of both human and non-human 

■ 

vertebrate antibodies can be combined in a pool and clustered based 
their structural ensemble or ensemble averages or representative 

25 structure. This approach presumably reduces the risk of the biased 

library consisting only of structures of human antibodies generated by 
limited exposure to various antigens. By combining and clustering 
structures from both human and non-human vertebrate antibodies, 
this structural ensemble or ensemble average determined should better 

30 represent the functional epitope of the antibody family. In addition, 

compared to the approach based on consensus antibody sequences, the 
structural ensemble or ensemble average generated by using the 
methods of the present invention is based on some well-established 
structural principle instead of the ill-defined consensus sequences. 

35 The following lists the principles followed in clustering 
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structures: 

a) . Align structures based on the RMSD for C alfa carbon atoms in the 
backbone and Z- score and gaps in the length of amino acids. 

b) . Clustering structures into the same family progressively based on 
5 smaller RMSD values and smaller gaps in amino acids. 

c) . Clustering structure using globe or important motifs. 

It is believed that because structural repertoire is better way to 
represent functional repertoire, starting from structure should provide 

10 an important and more rational basis for library construction. The 

antibody-antigen interaction occurs on the 3D structure space rather 
than ID sequence space. The structure change in CDRs should be 
better represented in 3D space. Using the structure as the criteria 
without details into the exact interaction between Ag and Ab should be 

15 make it possible to score for human sequence better compatible with 
the representative structure motif or ensemble. 

3. Selection of Antibody Sequences that Fit onto the Targeted 
Structural Ensemble or Ensemble Average 

20 

Once the structures of the antibodies are clustered using 
methods described above, either the structure ensemble within a 
cluster or its ensemble average or its representative member can serve 
as the target structural scaffold in the search for those human antibody 

25 sequences that adopt the same or similar 3D structure. For example, 
an ensemble average of the structures of a target antibody can be used 
as a structural template in the search in a protein database for 
antibody sequences with diverse sequences and yet retaining the same 
functionally relevant structure. 

30 In a preferred embodiment, the human antibody sequences are 

selected from the human immunoglobulin germline sequences. The 
germline sequences have been clustered into different sequence families 
including the V-genes, D-genes and J-genes. The rearranged 
immunoglobin sequences are collected in the Kabat-Wu sequence 

35 databases (Johnson & Wu, Kabat Database and its applications: future 
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directions (2001) 29, 205-206). These human immunoglobin sequences 
are retrieved from the Kabat-Wu sequence databases and stored in the 
human immunoglobulin (or antibody) sequence data of the present 
invention (Figure 1). 

5 According to the present invention, a variety of methods can be 

used to search for those human antibody sequences that adopt the 
same or similar structure as the target structural scaffold. The 
following are examples of the methods that may be used for achieving 
this purpose. 

10 

1) Reverse threading 

The conventional threading of protein sequence is used to predict 
the 3D structure scaffold of a protein. Typically, it is a process bf 
assigning the folding of the protein by threading its sequence to a 

15 library of potential structural templates by using a scoring function that 
incorporates the sequence as well as the local parameters such as 
secondary structure and solvent exposure. Bowie et al. (1991) Science 
253:164-170; Rost et al. (1997) 270:471-480; Xu and Xu (2000) 
Proteins: Structure, Function, and Genetics 40:343-354; and 

20 Panchenko et al. (2000) J. Mol. Biol. 296:1319-1331. For example, Rost 
et al. supra the threading process starts from prediction of the 
secondary structure of the amino acid sequence and. solvent 
accessibility for each residue of the query sequence. The resulting one- 
dimensional (ID) profile of the predicted structure is threaded into^ach 

25 member of a library of known 3D structures. The optimal threading for 
each sequence-structure pair is obtained using dynamic programming. 
The overall best sequence-structure pair constitutes the predicted 3D 
structure for the query sequence. 

In contrast, the reverse threading of the present invention is a 

30 process of finding the optimal sequence within a library of sequences to 
fit onto a target structure. Various scoring functions may be used to 
select for the optimal sequence(s) from the library comprising antibody 
sequences with various lengths. In a preferred embodiment, the scoring 
function is capable of discriminating the following interactions among 

35 different sequences with different lengths: (a) The interactions between 
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the side chains and backbone template as well as between side chains; 
and (b) the gap penalties for sequences with varying lengths in CDR1, 
CDR2 and CD3 regions. 

For example, amino acid sequences from a human germline 
5 immunoglobulin database can be threaded onto the 3D structure of the 
target structural template (or scaffold) and to search for the sequences 
with optimal acceptable scores. 

2) Matching the target structure with the optimal sequence 

10 composition of multiple aligned sequence family 

For this method, the optimal sequence that will fit onto the target 
structure is selected by matching the target structure with the optimal 
sequence composition of multiple aligned sequence family. The top 
hitting sequences found from human antibody sequence database can 

15 be optimized at each position with all possible composition to yield the 
best sequence composition that fits a target structure based on the 
scoring of the interactions between side chains and backbone and side 
chain and side chain. 

20 3) Selecting the optimal sequence bv homology alignment with the 
sequence of the target structure 

Another method of selecting human antibody sequence that will 
fit onto the structural scaffold of each member of the structural family 
is through homologous alignment with the amino acid sequence of the 

25 representative structure within a family. Such a method of structure- 
based sequence alignment can be practiced by the following procedure. 

The target structure may be a member of the structural family 
clustered by using the method described in Section 1. This target 
structure serves as a structural scaffold with which a library of human 

30 antibody sequences are matched. The matching process is performed 
through homologous alignment of the library of human antibody 
sequences with the amino acid sequence of the target structure (the 
sequence template). This method is a process of indirect structure- 
based sequence query, instead of directly searching for sequences that 

35 can be thread onto the structural scaffold in a reverse threading 
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process described in Section 1) above. Through homologous alignment 
with the sequence template of the target structure, optimal human* 
antibody sequences will be efficiently selected based on simple sequence 
alignment method such as BLAST. 
5 The following is an example of selecting optimal human antibody 

sequence(s) by using the indirect structure-based sequence alignment 
according to the present invention. 

This example demonstrates that fully human antibody sequences 
with extremely high sequence homology (100% sequence identity) could 

10 be found by matching the library of human antibody sequences against 
the sequence template of the target structure, i.e., the query sequence. 
It can be reasonably assumed that the antibody sequence having the 
highest sequence identity with the query sequence should adopt the 
same or a very similar structured that of the query sequence. This 

15 sequence(s) is included in the library of selected human antibody to 

represent the same scaffold as the target structure. For each member 
of the structural family, human antibody sequences can be selected to 
match the sequence of the structural ensemble or ensemble average 
(there is only one member within each family). The selected human 

20 antibody sequences are combined to form a library relatively small in 
sequence space and yet functionally diverse. 

In this example, the library of human antibody germline 
sequences (HuCal sequences) serves as the library of human antibody 
sequences (Figure 1). The HuCal sequences in fasta format as shown 

25 in Figure 2 were divided into variable light chains and variable heavy 
chains. These sequences were then used to compare with human 
germline sequences using Blast (Basic Local Alignment Search Tool) 
The amino acid sequences of the consensus human germline 
sequences that are clustered by using the method described in Section 

30 1 serve as the query sequences. Each of the query sequences and the 
human germline sequences were aligned and ranked in decreasing 
identity. Figure 10 shows the PDB IDs of the query sequence, name of 
the retrieved germline gene segment, sequence id no, residues aligned, 
high score, P(N) sum, smallest probability, % identity with the query 

35 sequence, the germline family to which the identified germline sequence 
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belongs to (vhaagrp-fl.aa stands for the fl subfamily of VH chdin; 
vkallaa-fl stands for fl subfamily of VL kappa chain; vlallaa-f 1 .aa 
stands for f 1 subfamily of VL lamda chain). ' 

Figure 11 shows the homology alignment for each of the selected 
5 human antibody germline sequences with the query sequence. 

As shown in Figures 10 and 11, human antibody germline 
sequences with up to 100% homology with the query sequence 1 can be 
found from the library of human antibody germline sequences.' For 
example, 1DHA, 1DHW and 1DHV have the identical sequence as the 
10 germline sequence segment, while close germiline homologues can be 
found for other sequences corresponding to the target structural 
models. These are trivial cases because there is only one query 
sequence for each structural template. 

■ 

15 4) Selecting the optimal remote homologous sequence(s) of 

structure-based multiple sequence alignment by using profile Hidden 
Markov Model. 

Given one clustered structure family, how to search for optimal 

20 sequence(s) that match with their aligned multiple sequence profile 

corresponding to their structure alignment? The flow chart in Figure 
12A illustrates an indirect approach to search for remote homologues 
consistent with multiple sequence alignment from clustered structures. 
The clustered structures within a structure family can give multiple 

25 sequence alignment based on their 3D structures. These aligned 
sequences might come from different species; they may be close or 
remote sequence homologues. The multiple sequence alignmeht can be 
used, however, to build a profile Hidden Markov Model (HMM^and this 
HMM can then be used to select the close and/or remote human 

30 homologues from human sequence database such as the human 
germline and/or rearranged sequence database. 

Figure 12B shows the result generated by using the method 
diagramed in Figure 12A based on a sequence profile of a structure 
cluster of the FR regions of 3 V H sequences. The structure cluster of 

35 the framework regions of 3 V H sequences, ldha, ldho and ldhw, is 
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shown in Figure 4A. Sequences of the FR regions of these 3 V H in the 
structure cluster were obtained by removing CDR1-3 from V Hl which are 
designated as FR123 (Figure 12B). FR123 sequences were used to 
build HMM and search human germline antibody sequences or 

■ 

5 humanized sequences. Fifty-two human germline antibody sequences 
(i.e., hits for FR123) were found. Variants in each position of the amino 

■ 

acid residues were profiled. Variants that occur less than 5 times in the 
position were filtered (i.e., cutoff value = 5) and discarded. The rest of 
the variants were combined combinatorially to produce a library of 
10 recombinant FR sequences. The hits for FR123 and/ or the 

recombinant FR sequences can be scored based on their structurally 
compatibility with the structure cluster of the framework regions. 

* 

Figure 12C shows the result generated by using the method 
diagramed in Figure 12A based on a sequence profile of a structure 

15 cluster of the CDR1 of 4 kappa V L sequences. The structure cluster of 
the 4 kappa V t sequences, ldgx, ldh5, ldh4, and ldh6, is shown in 
Figure 8B. Sequences of CDR1 of these 4 kappa V L sequences in the 
structure cluster were used to build HMM and search Kabat database. 
The regions in these 4 kappa V L sequences showing a greater variability, 

20 i.e., the supervariable regions, are highlighted in red (Figure 12C). 

Numerous hits were found with diverse sequences and variable lengths. 
The hits were grouped according to their lengths. The group having the 
same length as one of the 4 kappa V L CDR1 sequences were compared 
and profiled based on variability in each amino acid residue. Such a 

25 variant profile was built for each of the 4 kappa Vl CDR1 sequences, 
ldgx, ldh5, ldh4, and ldh6. To demonstrate, hits with lengths 
different than these 4 target sequences were also selected by using the 
inventive method, three artificial sequences, IdhSa, ldh5b, and ldh5c, 
were constructed by inserting more residues into the supervariable 

30 region of ldh5 and used as references to group these hits. As shown in 
the right portion of Figure 12C, hits with lengths different from the 4 
"real" target sequences, ldgx, ldh5, ldh4, and ldh6, were also be 
found, variant profiles of which were shown underneath each of the 3 
artificial sequences. 1 

35 The variant profiles shown in Figure 12C reveal that there is a 

■ 

* 
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much higher variability in the supervariable region than the rest of the 
CDR1. The amino acid residues in the supervariable region may make 
a greater contribution to the specific and high affinity binding of the 
antibody to its antigen. This region can be specifically targeted to 
5 generate a more focused library of recombinant antibodies for 
structural and functional screening in silico, in vitro or in vivo. 

As also shown in Figure 12C, the CDR1 variants with less than 
5% of occurrence frequency were filtered and discarded. The rest of the 
variants were combined combinatorially to produce a library of 
10 recombinant CDR1 sequences. The hits for CDR1 and/or the 

recombinant CDR1 sequences can be scored based on their structurally 
compatibility with the structure cluster of the 4 kappa V L CDR1 
sequences. , 

For the reclustered structure family in the Hucal model of 

15 Knappik et al., three of the V H structures are re-clustered into one 

family based on the structure criteria (superimposition and gaps), these 
three sequences should be used as the profiled sequences to build their 
HMM and then search the corresponding human germline sequence 
that is closest to all of them. Figure 13 A-C show the results of the 

20 search using this method. The identified human germline sequences 
(labeled as Top Hits") can then be used to represent the corresponding 
structure in our diversity library for the target structures. 

As shown in Figure 13A, the seven VHs of the Hucal models are 
clustered into 5 structure families: (1DGA, 1DHO, 1DHW), 1DHQ, 

25 1DHU, 1DHV, and 1DHZ. The seven VLs of the Hucal models are 

clustered into 3 structure families: (1DGX, 1DH5) and (1DH4, 1DH6) for 
Kappa VL, and (1DH7, 1DH8, 1DH9) for Lamda VL. Figure 13B shows 
the alignment of the amino acid sequences based on the structures of 
the members within each structure family. 

30 Figure 13 C lists the top hits of human germline antibodies 

identified by using the profile HMM method (HMMER2.1.1). The HMM 
has been calibrated; and E-values are empirical estimates. The top hits 
to the query sequence profile shows some important features which 
make it necessary to capture in order to make a comprehensive library 

35 for the clustered structure family of 1DHA, 1DHW and 1DHO. It is 
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noted that 1DHW belongs to a different family of VH (f5 see Figure 10) 
where 1DHA and 1DHO belong to the same family of VH (fl in Figure 
lO) based on the sequence homology classification. It is also apparent 
that comparison between hits and query sequence profile show that in 
5 some regions the sequence are highly conservative whereas in other 
regions sequence variability is large. The constant region should be 
good part for making master framework whereas the highly variable 
regions are some position for making sequence library. 

It should be noted that the order of the top hits depends 
1 0 sensitively on the multiple sequence alignment derived from structure- 
based alignment. This demonstrates that the structure information is 
important for selecting the hit sequences. As shown in Figure 13C, 
some of the top hits are non trivial from those obtained by BLAST. 

15 5). Matching the library of structural template with the library of 
sequence pools 

r 

A powerful approach to compare the target structural template 
library with the sequence database is to match them in both directions. 

20 Using threading to find the optimal template for each sequence among 
the sequence database and then using reverse threading to match each 
template structure to sequence in the sequence database. The 
convergence of the both direction should give a reliable sequence to 
construct the sequence libraries for the desired target structures. This 

25 method can be also used in combination with other sequence searching 
method such as COBLATH that combines PSI-BLAST with Threading 
method. 



30 



4. Examples of Structural Computational Engines 



Many programs are available for modeling structures or 
structural ensembles of antibodies. For example, a molecular 
mechanics software may be employed for these purposes, examples of 
which include, but are not limited to CONGEN, SCWRL, UHBD, and 
35 GENPOL. 
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CONGEN (CONformation GENerator) is a program performing 
conformational searches on segments of proteins (R. E. Bruccoleri ' 
(1993) Molecular Simulations 10, 151-174 (1993); R. E. Bruccoleri, E. 
Haber, J. Novotny, (1988) Nature 335, 564-568 (1988); R. Bruccoleri, M. 
5 Karphis. (1987) Biopolymers 26, 137-168. It is most suited to problems 
where one needs to construct underdetermined loops or segments in a 
known structure, i.e. homology modeling. The program is a 
modification of CHARMM version 16, and has most of the capabilities of 
that version of CHARMM (Brooks BR, Bruccoleri BE, Olafson BD, States 

10 DJ, Swaminathan S, Karplus M. (1983) J. Comput. Chem. 4, 187-217). 
The energy functions of the total energy include bonds, angles, 
torsional angels, improper term, vdw and electrostatic interactions with 
distance dependent dielectric constant using Amber94 forcefield in 
CONGEN. It provides a simple yet fast way to scan sequence library for 

15 their compatibility with their template structure with decent correlation 
with the more refined scoring energy functions. 

The CONGEN program is a modeling stratagem based on the 
theory that the lowest energy conformation should be close or 
correspond to the naturally occurring one. Bruccoleri and Karplus 

20 (1987) Biopolymers 26:137-168; and Bruccoleri and Novotny (1992) 
Immunomethods 96-106. Given an accurate Gibbs function and a 
short loop sequence, all of the stereochemically acceptable structures of 
the loop can be generated and their energies calculated. The one with 
the lower energy is selected. 

25 The program can be used to perform both conformational 

searches and structural evaluation using standard scoring function. 
The program can calculate other properties of the molecules such as 
the solvent accessible surface area and conformational entropies, given 
steric constraints. Each one of these properties in combination with 

30 other properties described below can be used to score the digital 
libraries. 

The defined canonical structures are available for five of the 
CDRs (VlCDRI, 2, and 3, and V H CDR1 and 2) except for V H CDR3. V H 
CDR3 is known to show large variation in its length and conformations, 

■ 

35 although progress has been made in modeling its conformation with 
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increasing number of antibody structures becoming available in the 
PDB (protein data bank) database. CONGEN may be used to generate 
conformations of a loop region (e.g., V H CDR3) if no canonical structure 
is available, to replace the side chains of the template sequence with the 
5 corresponding side chain rotamers of the target amino acids. Third, the 
model will be further optimized by energy minimization or molecular 
dynamic simulation or other protocols to relieve the steric clash etc in 
the structure model. 

SCWRL is a side chain placing program that can be used to 

10 generate side chain rotamers and combinations of rotamers using the 
backbone dependent rotamer library (Dunbrack RL Jr, Karplus M 
(1993) J Mol Biol 230:543-574). SCWRL is a program for adding 
sidechains to a protein backbone based on the backbone-dependent 
rotamer library (Bower, MJ, Cohen FE, Dunbrack RL (1997) J Mol Biol 

15 267, 1268-1282). The library provides lists of chil-chi2-chi3-chi4 
values and their relative probabilities for residues at given phi-psi 
values. The program can further explore these conformations to 
minimize sidechain-backbone clashes and sidechain-sidechain clashes. 
Once the steric clash is minimized, the side chains and the backbone 

20 of the substituted segment can be energy minimized to relieve local 
strain using CONGEN (Bruccoleri and Karplus (1987) Biopolymers 
26:137-168). Each structure is scored using a custom energy function 
that measures the relative stability of the sequence in the lead 
structural template. 

25 Several automatic programs that are developed specifically for 

building antibody structures may be used for structural modeling of 
antibody in the present invention. The ABGEN program is an 
automated antibody structure generation algorithm for obtaining 
structural models of antibody fragments. Mandal et al. (1996) Nature 

30 Biotech. 14:323-328. ABGEN utilizes a homology based scaffolding 
technique and includes the use of invariant and strictly conserved 
residues, structural motifs of known Fab, canonical features of 
hypervariable loops, torsional constraints for residue replacements and 
key inter- residue interactions. Specifically, the ABGEN algorithm 

35 consists of two principal modules, ABalign and ABbuild. ABalign is the 
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program that provides the alignment of the antibody V-region sequence 
whose structure is desired with all the V-region sequences of antibodies 
whose structures are known and computes scores for the fitting. The 
highest scoring library sequence is considered to be the best fit to the 
5 test sequence. ABbuild then uses this best fit model output by Abalign 
to generate the three-dimensional structure and provides Cartesian 
coordinates for the desired antibody sequence. 

WAM (Whitelegg NRJ and Rees, AR (2000) Protein Engineering. 
13, 819^824) is an improved version of ABM which is uses a combined 

10 algorithm by (Martin, ACR, Cheetham, JC, and Rees AR (1989) PNAS 
86, 9268-9272) Rees etc- to model the CDR conformations using the 
canonical conformations of CDRs loops from x-ray PDB database and 
loop conformations generated using CONGEN (see reference by Rees 
1995 (Ab antibody engineering). In short, the modular nature of 

1 5 antibody structure make it possible to model its structure using a 

combination of protein homology modeling and structure predictions. 

In a preferred embodiment, the following procedure will be used 
to model antibody structure. Because antibody is one of the most 
conserved proteins in both sequence and structure, homology models of 

20 antibodies are relatively straightforward, except for certain CDR loops 

that are not yet determined within existing canonical structures or with 
insertion or deletions. These loop structures can be, however, modeled 
using a combined algorithms that combines homology modeling with 
conformational search (for example, CONGEN can be used for such 

25 purpose). 

The defined canonical structures for five of the CDRs (LI, 2,3 and 
VH1,2) except for H3 (i.e., V H CDR3) are used. V H H3 is known to show 
large variation in its length and conformations, although progress has 
been made in modeling its conformation with increasing number of 

30 antibody structures become available in the (protein data bank) PDB 
database using protein structure prediction methods, including 
threading and comparative modeling, which aligns the sequence of 
unknown structure with at least one known structure based on the 
similarity spanning modeled sequence. The de novo or ab initio 

35 methods also show increasing promising to predict the structure from 
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sequence alone. The unknown loop conformations can be sampled 
using CONGEN if no canonical structure is available (Bruccoleri RE, 
Haber E, Novotny J (1988) Nature 355, 564-568). Alternatively, ab 
initio methods including but not limited to Rosetta ab initio method can 
5 be used to predict antibody CDR structures {Bonneau R, Tsai J, 

Ruczinski I, Chivian D, Rohl C, Strauss CE, Baker D (2001) Proteins 
Suppl 5, 1 19-126) without relying on similarity at the fold level between 
the modeled sequence and any of the known structures. The more 
accurate method that uses the state-of-the-art explicit solvent 

10 molecular dynamics and implicit solvent free energy calculations can be 
used to refine and select for native-like structures from models 
generated from either CONGEN or Rossetta ab initio method (Lee MR, 
Tsai J, Baker D, Kollman PA (2001) J Mol Biol 313, 417-430). The 
interactions between CDRs are first scored using the principles that 

15 determine the structure of p-sheet barrels in proteins. 

5. Scoring functions for evaluating structural compatibility of 
tester sequence and structural template 

20 In the implementation of the inventive methods described above, 

thermodynamic computational analysis can be used for evaluating 
structural compatibility of a tester sequence with a target structural 

» 

template. The structural evaluation is based on an empirical and 
parameterized scoring function and is intended to reduce the number of 

* 

25 subsequent in vitro screenings necessary. The scoring function consists 
of three energy terms: nonpolar solvation, sidechain entropy, and 
electrostatic energy (Sharp KA. (1998) Proteins 33, 39-48; Novotny J, 
Bruccoleri RE, Davis M, Sharp KA (1997) J Mol Biol 268, 401-41 1). 

For energy functions, there are many that can be used to score 

30 compatibility of sequences with template structure or structure 

ensemble. The scoring function is composed of several terms including 
contribution from electrostatic and van der Waals interactions, AGmm 
calculated using molecular mechanic forcefield, contribution from 
solvation including electrostatic solvation and solvent-accessible 

35 surface, AGsoi, and contribution from the conformational entropy. 
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A simple fast way for computational screening is to calculate 
structural stability of a sequence using the total or combination of " 
energy terms from molecular mechanic forcefield such as Amber94 
implemented in CONGEN. 

5 

AEtotal 83 Evdw + Ebond + Eangei + Electrostatics + Esohmtion 

or alternatively, the binding free energy is calculated as 



10 AG b - AGmm + AGsoi (Ag-Ab) - AG^i (Ag) - AG fio i (Ab) -TAS 

where: 

AGMM - AGele + AGvdw (1) 
AG M | = AGele-sol + AG ASA (2) 

15 

The AGeie and AGvdw electrostatic and van der Waals interaction energy 
are calculated using Amber94 parameters implemented in CONGEN for 
AGmm, whereas the AG c ie-*oi is electrostatic solvation energy required to 
move a heterogeneously distributed charges from the gas phase into an 

20 aqueous phase. This is calculated by solving the Poisson-Boltzmann 
equation for the electrostatic potential for the reference and mutant 
structures. AGasa, the nonpolar energy is the energetic cost of moving 
nonpolar solute groups into polar solvent, resulting in reorganization of 
the solvent molecules. This has been shown to correlate linearly with 

25 the solvent accessible surface area of the molecule (Sitkoff D, Sharp, 
KA, Honig B (1994) J Phys Chem 98, 1978-1988). 

The change in sidechain entropy is a measure of the effect on the 
local sidechain conformational space particularly at the binding 
interface. This is calculated from the ratio of the number of allowed 

30 sidechain conformations in the reference and mutant structures, in the 
bound and unbound states. For general scoring purposes, the 
independent sidechain approximation is applied to the mutated 
sidechains in order to reduce computational time resulting from 
sampling the huge conformational space for individual side chains in 

35 various structural context. 
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6. Energy Functions 

Many energy functions can be used to score the compatibility 
5 between sequences and structures. There are four kinds of energy 
functions can be used: (1) empirical physical chemistry-based 
forcefields based on simple model compounds such as standard 
molecular mechanic forcefields discussed below; (2) knowledge-based 
statistical forcefields extracted from protein structures, the so called 

1 0 potential of mean force (PMF) or the threading score derived from the 
structure-based sequence profiling (3) parameterized forcefield by 
fitting the forcefield parameters using experimental model system; (4) 
combinations of one or several terms from (1) to (3) with various' 
weighting factor for each term. 

1 5 The following well-tested physical-chemistry forcefields can be 

used or incorporated into the scoring functions. For example, amber 94 
fircefield was used in Congen to score the sequence-structure 
compatibility in the examples below. The forcefields include but are 
not limited to the following forcefields which are widely used for those 

20 skilled at the art. Amber 94 (Cornell, WD, Cieplak P, Bayly CI, Gould 
1R, Merz KM Jr, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW and 
Kollman PA. JACS (1995) 117, 5179-5197 (1995); Charmm forcefields 
(Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., 
Swaminathan, S., Karplus, M. (1983) J. Comp. Chem. 4, 187-217.; 

25 MacKerell, A D ; Bashford, D; Bellott, M; Dunbrack, R L; Eva seek, J D; 
Field, M J; Fischer, S; Gao, J; Quo, H; Ha, S; JosephMcCarthy, D; Kuc 
nir, L; Kuczera, K; Lau, F T K; Mattos, C; Michnick, S; Ngo, T; Nguyen, 
D T; Pro horn, B; Reiher, W E; Roux, B; Schlenkrich, M; Smith, J C; 
Stote, R; Straub, J; W tanabe, M; WiorkiewiczKuczera, J; Yin, D; 

30 Karplus, M (1998) J. Phys. Chem., B 102, 3586-3617). The Discover 
cvff forcefields (Dauber-Osguthorpe, P.; Roberts, V. A.; Osguthorpe, D. 
J.; Wolff, J.; Genest, M.; Hagler, A. T. (1988) Proteins: Structure, 
Function and Genetics, 4, 31-47.) The ECEPP forcefields (Momany, F. 
A., McGuire, R. F., Burgess, A. W., & Scheraga, H. A., (1975) J. Phys. 

35 Chem. 79, 2361-2381.; Nemethy, G., Pottle, M. S., & Scheraga, H. A., 
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(1983) J. Phys. Chem. 87, 1883-1887.). The GROMOS forcefields 
(Hermans, J., Berendsen, H. J. C, van Gunsteren, W. F., & Postma, J. 
P. M., (1984) Biopolymers 23, 1). The MMFF94 forcefields (Halgren, T. 
A. (1992) J. Am. Chem. Soc. 1 14, 7827-7843.; Halgren, T. A. (19?6) J. 
5 Comp. Chem 17, 490-519.; Halgren, T. A. (1996) J. Comp. Chem. 17, 

520-552.; Halgren, T. A. (1996) J. Comp. Chem. 17, 553-586.; Halgren/ 
T. A., and Nachbar, R. B. (1996) J. Comp. Chem. 17, 587-615.; Halgren, 
T. A. (1996) J. Comp. Chem. 17, 616-641.). The OPLS forcefields {see 
Jorgensen, W. L., & Tirado-Rives, J. ,(1988) J. Am. Chem. Soc. 1 10, 

10 1657-1666.; Damm, W., A. Frontera, J. Tirado-Rives and W. L. 

Jorgensen (1997) J. Comp. Chem. 18, 1955-1970.). The Tripose 
forcefield (Clark, M., Cramer III, R. D., van Opdenhosch, N., (1989) 
Validation of the General Purpose Tripose 5.2 Force Field, J. Conip. 
Chem. 10, 982-1012.) . The MM3 forcefield (Lii, J-H., & Allinger, N. L. 

15 (1991) J. Comp. Chem. 12, 186-199). Other generic forcefields such as 
Dreiding (Mayo SL, Olafson BD, Goddard (1990) J Phy Chem 94, 8897- 
8909) or specific forcefield used for protein folding or simulations like 
UNRES (United Residue Forcefield; Liwo et al., (1993) Protein Science 2, 
1697-1714; Liwo et al., (1993) Protein Science 2, 1715-1731; Liwo et al., 

20 (1997) J. Comp. Chem. 18, 849-873; Liwo et al., (1997) J. Comp. Chem. 
18:874-884; Liwo et al., (1998) J. Comp. Chem. 19:259-276. 

The statistical forcefields derived from protein structures can be 
also used to assess the compatibility between sequences and protein 
structure. These potential include but not limited to residue pair 

25 potentials (Miyazawa S, Jernigan R (1985) Macromolecules 18, 534-552; 
Jernigan RL, Bahar, I (1996) Curr. Opin. Struc. Biol. 6, 195-209). The 
potentials of mean force (Hendlich et al., (1990) J. Mol. Biol. 216, 
167-180) has been used to calculate the conformational ensembles of 
proteins (Sippl M (1990) J Mol Biol. 213, 859-883). However, some 

30 limitations of these forcefields are also discussed (Thomas PD, Dill KA 
(1996) J Mol Biol 257, 457-469; Ben-Nairn A (1997) J Chem Phys 107, 
3698-3706). Another methods to score the compatibility between 
sequences and structure is to use sequence profiling (Bowie JU, Luthy 
R, Eisenberg DA (1991) Science 253, 164-170) or threading scores 

35 (Jones DT, Taylor WR, Thornton JM (1992) Nature 358, 86-89; Bryant, 
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SH, Lawrence, CE (1993) Proteins 16, 92-1 12; Rost B, Schneider R, 
Sander C (1997) J Mol Biol 270, 471-480; Xu Y, Xu D (2000) Proteins 
40, 343-354). These statistical forcefields based on the quasichemical 
approximation or Boltzmann statistics or Bayes theorem (Simons KT, 
5 Kooperberg C, Huang E, Baker D (1997) J Mol Biol 268, 209-225) are 
evaluated to assess the goodness of the fit between a sequence and a 
structure or for protein design (Dima RI, Banavar J R, Maritan A (2000) 
Protein Science 9, 812-819). 

The structure-based thermodynamic or parameters related to 

10 formation of the secondary structures of proteins can be also used to 
evaluate the fitness between a sequence and a structure. In the 
structure-based thermodynamic methods, the thermodynamic 
quantities such as heat capacity, enthalpy, entropy can be calculated 
based on the structure of a protein to explain the temperature- 

1 5 dependence of the thermal unfolding using the thermodynamic data 
from model compounds or protein calorimetry studies (Spolar RS, 
Livingstone JR, Record MT (1992) Biochemistry 31, 3947-3955; Spolar 
RS, Record MT (1994) Science 263, 777-784; Murphy KP, Freire E 
(1992) Adv Protein Chem 43, 313-361; Privalov PL, Makhatadze GI 

20 (1993) J Mol Biol 232, 660-679; Makhatadze GI, Privalov PL (1993) J 
Mol Biol 232, 639-659). The structure-based thermodynamic 
parameters can be used to calculate structural stability of mutant 
sequences and hydrogen exchange protection factors using ensemble- 
based statistical thermodynamic approach (Hilser VJ, Dowdy D, Oas 

25 TG, Freire E (1998) PNAS 95, 9903-9908). Thermodynamc parameters 
relating to statistical thermodynamic models of the formation of the 
protein secondary structures have been also determined using 
experimental model systems with excellent agreement between 
predictions and experimental data (Rohl CA, Baldwin RL (1998) 

30 Methods Enzymol 295, 1-26; Serrano L (2000) Adv Protein Chem 53, 
49-85). 

A combination of various terms from molecular mechanic 

forcefields plus some specific components has been used in most 

*- 

protein design programs. In a preferred embodiment, the forcefield is 
35 composed of one or several some terms such as the van der Waals, 
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hydrogen bonding and electrostatic interactions from the standard 
molecular mechanics forcefields such as Amber, Charmm, OPLS, cvff, 
ECEPP, plus one or several terms that are believed to control the 
stability of proteins. 

S 7. Examples of forcefields for protein design 

It is understood that as a general solution to protein design 
problem, the energy surface describing the interactions among all 
elements of the system are sampled as a function of its atomic 

10 coordinates over all available sequences and their conformational space. 
Such a procedure may be implemented in following steps: i) providing a 
target scaffold with the backbone structure, e.g., a X-ray crystal 
structure retrieved from protein databank (PDB) or a structural model 
built by modeling; ii) building side chain models of amino acid variants 

15 onto a selected backbone by using a rotamer library derived from a 

protein structure database; iii) assigning forcefield parameters such as 
charge, radii, etc. to each atom to construct the target function; and iv) 
searching the energy surface of the target function using deterministic 
and/or stochastic algorithms to find optimal solution or solutions 

20 ranked in their scores. 

Each individual protein design method is distinguished mainly 
from each other in terms of the forcefield and sampling algorithm. 
However, scoring functions and sampling algorithms in these protein 
design methods may optionally be used for a structure-based evaluation 

25 of the sequences from the hit and/or hit variant library. 

For example, as an important interaction for scoring the correct 
packing interactions inside the core of proteins, van der Waals (vdw) 
interaction was used to design the protein core sequences by testing 
allowed rotamer sequences in enumeration (Ponder JW, Richards FM 

30 (1987) J Mol Biol 193, 775-791. A group of sequences can be selected 
under a potential function using simulated evolution with stochastic 
algorithm; the ranking order of the energies of selected sequences for 
residues in the hydrophobic cores of proteins correlates well with their 
biological activities (Hellinga HW, Richards FM (1994) PNAS 91, 5803- 

35 5807). 
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Similar approaches were also used to design proteins using 
stochastic algorithm (Desjarlais J, Handel T, (1995) Protein Science 4, 
2006-2018; Kono H, Doi J (1994) Proteins, 19, 244-255). Effect of 
potential function on the designed sequences of a target scaffold has 
5 been evaluated by including van der Waals, electrostatics, and surface- 
dependent semiempirical environmental free energy or combinations of 
terms in an automatic protein design method that keeps the 
composition of amino acid sequence unchanged. It was shown that 
each additional term of the energy function increases progressively the 

10 performance of the designed sequences with vdw for packing, 

electrostatics for folding specificity and environmental solvation term for 
burial of the hydrophobic residues and for exposure of the hydrophilic 
residue (Koehl P, Levitt M (1999) J Mol Biol 293, 1161-1181). 

The self-consistent mean field approach was used to sample the 

15 energy surface in order to find the optimal solution, (Delarue M, Koehl. 
(1997) Pac. Symp. Biocomput. 109-121; Koehl P, Delarue M, (1994) J. 
Mol. Biol. 239, 249-275; Koehl P, Delarue M (1995) Nat. Struct. Biol. 
2,163-170; Koehl P, Delarue M (1996) Curr. Opin. Struct. Biol. 
6:222-226; Lee J. (1994) Mol. Biol. 236, 918-939; Vasquez (1995) 

20 Biopolymers 36, 53-70). Combination of terms from Molecular 

Mechanics or MM forcefield, knowledge-based statistical forcefield and 
other empirical correction has been also used to design protein 
sequences that are close to the native sequence of the target scaffold 
(Kuhlman B, Baker D (2000) PNAS 97, 10383-10388). The structure- 

25 based thermodynamic terms were included in addition to the steric 
repulsion in the protein core design (Jiang X, Farid H, Pistor E, Farid 
RS (2000) Protein Science 9, 403-416). Knowledge-based potentials 
have been used to design proteins (Rossi A, Micheletti C, Seno F, 
Maritan A (2001) Biophysical Journal 80, 480-490). 

30 Forcefields have been also optimized specifically for protein 

design purpose. The energy function is decomposed into pairwise 
functional forms that combine molecular mechanic energy terms with 
specific solvation term is used for residues at the core, boundary and 
surface positions; dead end elimination algorithm is used to sip through 

35 huge number of combinatorial rotameric sequences (Dahiyat BI, Mayo 
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SL (1996) Protein Science 5, 895-903). The stringency of force fields 
and rigid inverse folding protocol with fix backbone used in protein 
design has inevitably resulted a significant rate of false negative: 
rejection of many sequences that might be acceptable if soft energy 
5 function or flexible backbone is allowed. Moreover, the energy function 
used for protein design is so different from forcefields such as Amber or 
Charmm that are widely used and tested for studying protein folding or 
stability (Gordon DB, Marshall SA, Mayo SL (1999) Curr Opion Stru Biol 
9, 509-513). Cautions should be excised to compare the sequences 

10 designed using specific protocol with others from alternative methods 
because a direct comparison among them may not be possible due to 
the false negative issues involved in protein design protocols. 

The inventor believes that although a high false negative rate in 
protein design is not a problem for designing proteins with no 

15 restriction, this will pose serious problem for designing proteins for 
pharmaceutical application because only small restrictive region is 
allowed to have altered sequences to improve protein functions such as 
the CDRs in antibodies and a few positions in the framework regions. 
Therefore, it is accuracy rather than the speed of computational 

20 screening that matters the most for functional improvement in order to 
identify those few mutants in the targeted region. 

These methods can be used to generate structure ensembles by 
molecular dynamics calculations or computational methods of proteins 
in the native or unfolded states which provide more accurate methods 

25 to score sequence and its variants based on the ensemble averages of 
the energy functions (Kollman PA, Massova I, Reyes C, Kuhn B, Huo 
SH, Chong LT, Lee M, Lee TS, Duan Y, Wang W, Donini O, Cieplak P, 
Srinivasan P, Case DA, and Cheatham TE (2000) Acc. Chem Res. 33, 
889-897). The ensemble averages calculated from ensemble structures 

30 show better correlation with corresponding data from experimental 
measurement. 

In a particular embodiment, standard terms from MM terms have 
been combined with the solvation terms including electrostatic solvation 
and solvent-accessible solvation term calculated with continuous 
35 solvent model for electrostatic solvation; these MM-PBSA or MM-GBSA 
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method, together with contribution from the conformational entropy 
including backbone and side chains, have shown good correlation ' 
between experimental and calculated values in the free energy change 
(Wang W, Kollman P (2001) JMB). Compared to other scoring functions 
used in protein and drug design, MM-PBSA or MM-GBSA is better 
physical model for scoring and would handle various problems on an 
uniform basis, although it is computational expensive because multiple 
trajectories from molecular dynamic simulation in explicit water is 
required to calculate the ensemble averages for the system. 
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CLAIMS 

■ 

What is claimed is: 

5 1 . A method for constructing a library of recombinant antibodies, 
comprising the steps of: 

clustering variable regions of a collection of antibodies having 
known 3D structures into at least two families of structural ensembles, 
each family of structural ensemble comprising at least two different 
1 0 antibody sequences but with substantially identical main chain 
conformations; 

selecting a representative structural template from each family of 
structural ensemble; 

profiling a tester polypeptide sequence onto the representative 
15 structural template within each family of structural ensemble; and 

selecting the tester antibody sequence that is compatible to the 
structural constraints of the representative structural template. 

2. The method of claim 1, wherein the collection of antibodies 

20 include antibodies or immunoglobulins collected in a protein database. 

3. The method of claim 2, wherein the protein database is selected 
from the group consisting of the protein data bank of Brookhaven 
National Laboratory, genbank at the National Institute of Health, and 

25 Swiss-PROT protein sequence database. 

4. The method of claim 1, wherein the collection of antibodies 
having known 3D structures include antibodies having resolved X-ray 
crystal structures, NMR structures or 3D structures based on 

30 structural modeling. 

5. The method of claim 1, wherein the variable regions of the 
collection of antibodies are the full length heavy chain or light chain 
variable regions or specific portions of the heavy chain or light chain 
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variable region selected from the group consisting of CDR, FR, and a 
combination thereof. 



6. The method of claim 5, wherein the CDR is CDR1, CDR2> or 
5 CDR3 of an antibody. 

« 

7. The method of claim 5, wherein the FR is FR1 , FR2, FR3, or FR4 
of an antibody. 

10 8. The method of claim 1, wherein the clustering step includes 

clustering the collection of antibodies such that the root mean square 
difference of the main chain conformations of antibody sequences in 
each family of the structural ensemble is less than 4 A. 

15 9. The method of claim 1, wherein the clustering step includes 

clustering the collection of antibodies such that the root mean square 
difference of the main chain conformations of antibody sequences in 
each family of the structural ensemble is less than 3 A. 



20 10. The method of claim 1, wherein the clustering step includes 

clustering the collection of antibodies such that the root mean square 
difference of the main chain conformations of antibody sequences in 
each family of the structural ensemble is less than 2 A. 

25 1 1« The method of claim 1, wherein the clustering step includes 

clustering the collection of antibodies such that the root mean square 
difference of the main chain conformations of antibody sequences in 
each family of the structural ensemble is between about 0. 1-4.0 A. 

30 12. The method of claim 1, wherein the clustering step includes 

clustering the collection of antibodies such that the Z-score of the main 
chain conformations of antibody sequences in each family of the 
structural ensemble is more than 2. 
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13. The method of claim 1, wherein the clustering step includes 
clustering the collection of antibodies such that the Z-score of the main 
chain conformations of antibody sequences in each family of the 
structural ensemble is more than 3. 

14. The method of claim 1 , wherein the clustering step includes 
clustering the collection of antibodies such that the Z-score of the main 
chain conformations of antibody sequences in each family of the 
structural ensemble is more than 4. 

15. The method of claim 1, wherein the clustering step includes 
clustering the collection of antibodies such that the Z-scOre of the main 
chain conformations of antibody sequences in each family of the 
structural ensemble is between about 2-8. 

1 6. The method of claim 1 , wherein the clustering step is 

■ 

implemented by an algorithm selected from the group consisting of CE, 
Monte Carlo and 3D clustering algorithms. 

20 17. The method of claim 1 , wherein the profiling step includes 
reverse threading the tester polypeptide sequence onto the 
representative structural template within each family of structural 
ensemble. 

25 18. The method of claim 1, wherein the profiling step is implemented 
by a multiple sequence alignment algorithm. 

19. The method of claim 18, wherein the multiple sequence 
alignment algorithm is profile HMM algorithm or PSI-BLAST. 

30 

20. The method of claim 1 , wherein the representative structural 
template is adopted by a CDR region, and the profiling step includes 
profiling the tester polypeptide sequence that is a variable region of a 
human or non-human antibody onto the representative structural 

35 template within each family of structural ensemble. 
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2 1 . The method of claim 1 , wherein the representative structural 

template is adopted by a FR region, and the profiling step includes 

profiling the tester polypeptide sequence that is a variable region of a 

5 human antibody onto the representative structural template within 
each family of structural ensemble. 

22. The method of claim 21, wherein the tester polypeptide sequence 
is a variable region of human germline antibody sequence. 

10 

23. The method of claim 1, wherein the tester polypeptide sequence 
is the sequence or a segment sequence of an expressed protein. 

24. The method of claim 1, wherein the tester polypeptide sequence 
15 is a region of an antibody. 

25. The method of claim 24, wherein the antibody is a human 
antibody. 

20 26. The method of claim 1, wherein the tester polypeptide sequence 
is a region of a human germline antibody sequence. 

27. The method of claim 1, wherein the selecting step includes 
selecting the tester polypeptide sequence by using an energy scoring 

25 function selected from the group consisting of electrostatic interactions, 
van der Waals interactions, electrostatic solvation energy, solvent- 
accessible surface solvation energy, and conformational entropy. 

28. The method of claim 1^ wherein the selecting step includes 

30 selecting the tester polypeptide sequence by using a scoring function 
incorporating a forcefield selected from the group consisting of the 
Amber forcefield, Charmm forcefield, the Discover cvff forcefields, the 
ECEPP forcefields, the GROMOS forcefields, the OPLS forcefields, the 
MMFF94 forcefield, the Tripose forcefield, the MM3 forcefield, the 

35 Dreiding forcefield, and UNRES forcefield, and other knowledge-based 
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statistical forcefield (mean field) and structure-based thermodynamic 
potential functions. 

29. The method of claim 1, further comprising the steps of: 
5 building an amino acid positional variant profile of the selected 

tester polypeptide sequences; 

filtering out the variants with occurrence frequency lower than 3; 

and 

combining the variants remained to produce a combinatorial 
10 library of antibody sequences. 



30. The method of claim 29, wherein the filtering step includes 
filtering out the variants with occurrence frequency lower than 5. 

15 31. The method of claim 1 , further comprising the following: 

introducing the DNA segment encoding the selected tester 
polypeptide into cells of a host organism; . 

expressing the DNA segment in the host cells such that a 
recombinant antibody containing the selected polypeptide sequence is 
20 produced in the cells of the host organism; and 

selecting the recombinant antibody that binds to a target antigen 
with affinity higher than 10 6 M* 1 . 

32. The method of claim 31, wherein the recombinant antibody is a 
25 fully assembled antibody, a Fab fragment, an Fv fragment, or a single 

chain antibody. 

33. The method of claim 31, wherein the host organism is selected 
from the group consisting of bacteria, yeast, plant, insect, and mammal. 

30 

34. The method of claim 31, wherein the target antigen is a small 
molecule, proteins, peptide, nucleic acid or polycarbohydate. 



35 



35. A method of constructing a library of recombinant antibodies 
based on 
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a target structural template, comprising the steps of: 

providing a target structural template of a variable region of one 
or more antibodies; 

profiling a tester polypeptide sequence onto the target structural 
i template; and 

selecting the tester polypeptide sequence that is structurally 
compatible with the target structural template. 

36. The method of claim 35, wherein the target structural template is 
a 3D structure of a heavy chain or light chain variable region of an 
antibody. 

37. The method of claim 36, wherein the heavy chain or light' chain 
variable region of an antibody is a CDR, a FR or a combination thereof. 

38. The method of claim 35, wherein the target structural template is 
a 3D structural ensemble of heavy chain or light chain variable regions 
of at least two different antibodies. 



20 39. The method of claim 38, wherein the heavy chain or light chain 
variable regions are CDRs, FRs or combinations thereof. 

40. The method of claim 35, wherein the profiling step includes 
reverse threading the tester polypeptide sequence onto the target 

25 structural template. 

41. The method of claim 35, wherein the profiling step is 
implemented by a multiple sequence alignment algorithm. 

30 42. The method of claim 41, wherein the multiple sequence 
alignment algorithm is profile HMM algorithm or PSI-BLAST. 

43. The method of claim 35, wherein the target structural template is 
adopted by a CDR region, and the profiling step includes profiling the 
35 tester polypeptide sequence that is a variable region of a human or non- 
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human antibody onto the representative structural template within 
each family of structural ensemble. 

44. The method of claim 35, wherein the target structural template is 
5 adopted by a FR region, and the profiling step includes profiling the 
tester polypeptide sequence that is a variable region of a human 
antibody onto the representative structural template within each family 
of structural ensemble. 

10 45. The method of claim 44, wherein the tester polypeptide sequence 
is a variable region of human germline antibody sequence. 

46. The method of claim 35, wherein the tester polypeptide sequence 
is the sequence or a segment sequence of an expressed protein. 

15 

47. The method of claim 35, wherein the tester polypeptide sequence 
is a region of an antibody. 

* 

48. The method of claim 35, wherein the antibody is a human 
20 antibody. 

* 

49. The method of claim 35, wherein the tester polypeptide sequence 
is a region of a human germline antibody sequence. 

25 50. The method of claim 35, wherein the selecting step includes 
selecting the tester polypeptide sequence by using an energy scoring 
function selected from the group consisting of electrostatic interactions, 
van der Waals interactions, electrostatic solvation energy, solvent- 
accessible surface solvation energy, and conformational entropy. 

30 

51. The method of claim 35, wherein the selecting step includes 
selecting the tester polypeptide sequence by using a scoring function 
incorporating a forcefield selected from the group consisting of the 
Amber forcefield, Charmm forcefield, the Discover cvffforcefields, the 
35 ECEPP forcefields, the GROMOS forcefields, the OPLS forcefields, the 
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MMFF94 forcefield, the Tripose forcefield, the MM3 forcefield, the 
Dreiding forcefield, and UNRES forcefield, and other knowledge-based 
statistical forcefield (mean field) and structure-based thermodynamic 
potential functions. 

5 

52. The method of claim 35, further comprising the steps of: 
building an amino acid positional variant profile of the selected 

tester polypeptide sequences; 

filtering out the variants with occurrence frequency lower than 3; 

10 and 

combining the variants remained to produce a combinatorial 
library of antibody sequences. 

53. The method of claim 52, wherein the filtering step includes 
15 filtering out the variants with occurrence frequency lower than 5. 

54. The method of claim 35, further comprising the following: 
introducing the DNA segment encoding the selected tester 

polypeptide into cells of a host organism; 
20 expressing the DNA segment in the host cells such that a 

recombinant antibody containing the selected polypeptide sequence is 
produced in the cells of the host organism; and 

selecting the recombinant antibody that binds to a target antigen 
with affinity higher than 10 6 M- 1 . 

25 

55. The method of claim 54, wherein the recombinant antibody is a 
fully assembled antibody, a Fab fragment, an Fv fragment, or a single 
chain antibody. 

30 56. The method of claim 54, wherein the host organism is selected 

from the group consisting of bacteria, yeast, plant, insect, and mammal. 

57. The method of claim 54, wherein the target antigen is a small 
molecule, proteins, peptide, nucleic acid or polycarbohydate. 

35 
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58. A method for constructing a library of recombinant antibodies, 
comprising the steps of: 

providing a target sequence of a heavy chain or light chain 
variable region of a target antibody; 

5 aligning the target sequence with a tester polypeptide sequence; 

and 

selecting the tester polypeptide sequence that has at least 15% 
sequence homology with the target sequence. 

10 59. The method of claim 58, wherein the target sequence is the full 
length heavy chain or light chain variable region of the target antibody 
or specific portions of the variable regions of the target antibody 
selected from the group consisting of CDR, FR, and a combination 
thereof. 

15 

60. The method of claim 59, wherein the CDR is CDR1, CDR2, or 
CDR3 of the target antibody. 

6 1 . The method of claim 59, wherein the FR is FR1 , FR2, FR3, or 
20 FR4 of the target antibody. 

62. The method of claim 58, wherein the aligning step includes 
aligning the target sequence with the polypeptide segment of the tester 
polypeptide sequence by using a sequence alignment algorithm. 

25 

63. The method of claim 62, wherein the sequence alignment 
algorithm is 

selected from the group consisting of BLAST, PSI-BLAST, profile HMM, 
and COBLATH. 

30 

64. The method of claim 58, wherein the target sequence is a CDR 
region of the target antibody, and the alignment step includes aligning 
the tester polypeptide sequence that is the sequence or segment 
sequence of an expressed protein with the target sequence. 

35 
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65. The method of claim 58, wherein the target sequence is a FR 
region of the target antibody, and the alignment step includes aligning 
the tester polypeptide sequence that is the sequence or segment 
sequence of a human antibody protein with the target sequence. 

5 

66. The method of claim 58, wherein the selecting step includes 
selecting the tester polypeptide sequence that has at least 25% 
sequence homology with the target sequence. 

10 67. The method of claim 58, wherein the selecting step includes 
selecting the tester polypeptide sequence that has at least 35% 
sequence homology with the target sequence. 

68. The method of claim 58, wherein the selecting step includes 
15 selecting the tester polypeptide sequence that has at least 35% 

sequence homology with the target sequence. 

69. The method of claim 58, further comprising the steps of: 
building an amino acid positional variant profile of the selected 

20 tester polypeptide sequences; 

filtering out the variants with occurrence frequency lower than 3; 

and 

combining the variants remained to produce a combinatorial 
library of antibody sequences. 

25 

70. The method of claim 58, wherein the filtering step includes 
filtering out the variants with occurrence frequency lower than 5. 

71. The method of claim 58, further comprising the following: 
30 introducing the DNA segment encoding the selected tester 

polypeptide into cells of a host organism; 

expressing the DNA segment in the host cells such that a 
recombinant antibody containing the selected polypeptide sequence is 
produced in the cells of the host organism; and 
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selecting the recombinant antibody that binds to a target antigen 
with affinity higher than 10 6 M 1 . 

72. The method of claim 51 , wherein the recombinant antibody is a 
5 fully assembled antibody, a Fab fragment, an Fv fragment, or a single 
chain antibody. 



10 
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FIGURE 2 



>!DGXtL IMMUNOGLOBULIN VL DOMAIN 

DI QMTQ S PSSLS ASVGDR VTI TCRASQG I SS YLAWYQQKPGKAPKLLI YAASS LQSGVPS 
RFSGSG SGTDFTLT I S S LQPEDFATYYC QQHYTTP PTFGQGTKVE I KR 
>1DB4 t L IMMUNOGLOBULIN VL KAPPA DOMAIN 

DIVMTQSPLSLPVTPGEPASISCRSSQSLLHSNGYNYLDWYLQKPGQSPQLLIYLGSNRA 

SGVPDRFSGSGSGTDFTLKISRVEAEDVGVYYCQQHYTTPPTFGQGTKVEIKR 
>1DH5:L IMMUNOGLOBULIN VL KAPPA DOMAIN 

D I VLTQ S PATLS LS PGERATLSCRASQS VS S S YLAWYQQKPGQAPRLLI YGAS SRATGVP 

AR FSG SG SGTDFTLTI S S LEPEDFAVYYCQQHYTTP PTFGQGTKVE I KR 
>lDH6tL IMMUNOGLOBULIN VL KAPPA DOMAIN 

D I VMTQ S PDS LAVS IX3ERAT INCRS S QS VL YS SNNKNYLAWYQQKFGQPP KLL I YWASTR 

ESGVPDRFSGSGSGTDFTLTISSLQAEDVAVYYCQQHYTTPPTFGQGTKVEIKR 
>1DH7:L IMMUNOGLOBULIN VL LAMBDA DOMAIN 

QSVLTQPPSVSGAPGQRVTISCSGSSSNIGSNYVSWYQQLPGTAPKLLIYDNNQRPSGVP 

DRFSGSKSGTSASLAITGLQSEDEADYYCQQHYTTPPVFGGGTKLTVLG 
>1DB8 x L IMMUNOGLOBULIN VL LAMBDA DOMAIN 

QSALTQPASVSGSPGQSITISCTGTSSDVGGYNYVSWYQQHPGKAPKU1IYDVSNRPSGV 

SNRFSGSKSGNTASLTISGLQAEDEADYYCQQHYTTPPVFGGGTKLTVLG 
>1DH9iL IMMUNOGLOBULIN VL LAMBDA DOMAIN 

SYELTQPPSVSVAPGQTARISCSGDALGDKYA5WYQQKPGQAPVLVIYDDSDRPSGIPER 
FSGSNSGNTATLTISGTQAEDEADYYCQQHYTTPPVFGGGTKLTVLG 



>lDHAtH IMMUNOGLOBULIN HEAVY CHAIN VARIABLE DOMAIN 

QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGGIIPIFGTANY 
AQKFQGRVTI TADESTSTAYMELS SLRS EDTAVYYCARWGGDG FY AMD YWGQGTL VTVS S 
>lDHOtH IMMUNOGLOBULIN HEAVY CHAIN VARIABLE DOMAIN 

QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYYMHWVRQAPGQGLEWMGWINPNSGGTNY 
AQKFQGRVTMTRDKS S STAYMELS SLRS EDTA VYYCARWGGDGFYAMD YWGQGTL VTVS S 
>1DHQ:H IMMUNOGLOBULIN HEAVY CHAIN VARIABLE DOMAIN 

QVQLKESGPALVKPTQTLTLTCTFSGFSLSTSGVGVGWIRQPPGKALEWLALIDWDDDKY 

YSTSLKTRLTISKDTSKNQVVXiTMTNMDP\^TATYYCARV^ 

S 

>1DHUiH IMMUNOGLOBULIN HEAVY CHAIN VARIABLE DOMAIN 

EVQLVESGGGLVQPGGSLRLSO^GFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYY 
ADS VKGRFTI S RDNS KNTL YLQMNS LRAEDTA VYYCARWGGDGFYAMD YWGQGTLVTVS S 
>lDHVx H IMMUNOGLOBULIN HEAVY CHAIN VARIABLE DOMAIN 

QVQLQESGPGLVKPSETLSLTCTVSGGSISSYYWSWIRQPPGKGLEWIGYIYYSGSTNYN 
PS LKS RVTI S VDTS KNQFS LKLS S VTAADTAVYYCARWGGDGFYAMDYWGQGTLVTVS S 
>1DHWiH IMMUNOGLOBULIN HEAVY CHAIN VARIABLE DOMAIN 

EVQLVQSGAEVKKPGESLKISCKGSGYSFTSYWIGWVRQMPGKGLEWMGIIYPGDSDTRY 
S PSFQGQVTI SADKS I S TAYLQWS SLKASDTAMYYCARWGGDGFYAMD YWGQGTL VTVS S 
>1DHZiB IMMUNOGLOBULIN HEAVY CHAIN VARIABLE DOMAIN 

QVOLQQSGPGLVKPSQTLSLTCAISGDSVSSNSAAWNWIRQSPGRGLEWLGRTYYRSKWY 

NDYAVSVKSRITINPDTSKNQFSLQLNSVTPEDTAVYYCARWGGDGFYAMDYWGQGTLVT 
VSS 
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FIGURE 11 



SeqU' 



and the Selected Human Antibody Germline Sequence 

1DHA>VHGIi1.8 . 

Length » 98 

Score = 502 (176.7 bits). Expect * 7.9e-52, P » 7.9e-52 
Identities = 98/98 (100%), Positives = 98/98 (100%) 

Query: 1 QVQLVQSGAEVKKPGSS VKVSCXASGGTFSS YAIS WVRQAPGQGLEWMGQI I P I FGTANY 60 

QVQLVQSGAEVKKPGSS VKVSCKAS GGTFSS YAI S WVRQAPGQGLEWMGGI I PI FGTANY 
Sbjct: 1 QVQLVQSGAEvTOCPGSS VTCVSCKASGGTFSSYAIS VfVliQAPGQGLEWMGGl I PI FGTANY 60 

Query: 61 AQKFQGRVT I TADE S TS TAYMELS S LRSEDTAVYYCAR 98 

AQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCAR . 
Sbjct: 61 AQKFQGRVT ITADESTSTAYMELSS LRSEDTAVYYCAR 98 

lDHW>DP-73/V5-51. • . 

Length = 98 

Score = 523 (184.1 bits), Expect * 8.3e-55, P = 8.3e-5S 
Identities » 98/98 (100%), Positives « 98/98 (100%) 

Query: 1 EVQLVQSGAEVTCKPGESLKISCKGSGYSFTSYWIGWVRQMPGKGLEWMGIIYPGDSDTRY 60 

E^QLVQSGAEVlOCPGESLKISCKGSGYSFTSYWIGWvllQMPGKGLEWMGIIYPGDSDTRY 
Sbjct: 1 WQLVQSGAEVKKPGESLKISCKGSGYSFTSYWIGWVRQMPGKGLEWMGIIYPGDSDTRY 60 

Query: 61 SPSFQGQVTISADKSISTAYLQWSSLKASDTAMYYCAR 98 

S P S FQGQVT I SADKS I STA YLQWS S LKASDTAMYYCAR 
Sbjct: 61 S P S FQGQVTI SADKS I STAYLQWS S LKASDTAMYYCAR 98 

lDH0>DP-75/VI-2. ♦ . 

Length = 98 



* 498 (175.3 bits), Expect = 2.1e-51, P » 2.1e-51 
Identities « 93/98 (94%), Positives « 94/98 (95%) 

Query: 1 QVQLVQSGAEVTCKPGASVTCVSCKASGYTFTSYYM 60 

QVQLVQSGAEVKKPGASVKVSCKASGYTFT YYMHWVRQAPG QG LEWMGW INPNSGGTNY 
Sbjct: 1 QVQLVQSGAEVTGCPGASVTCVSCKASGYTFTGYYMHWVRQATC^ 60 

Query: 61 AQKFQGRVTMTRDKSSSTAYMELSS LRSEDTAVYYCAR 98 

AQKFQGRVTMTRD S STAYMELS LRSEDTAVYYCAR 
Sbjct: 61 AQKFQGRVTMTRDTS 1STAYMELSRLRSDDTAVYYCAR 98 
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FIGURE 11- continued 

lDHQ>S12-9/DP-27. . . 1 

Length « 100 

Score = 494 (173.9 bits), Expect = 1.9e-51, P » 1.9e-Sl 
Identities * 94/99 (94%), Positives = 96/99 (96%) 

Query: 1 QVQLKESGPALVKPTQTLTLTCTFSGFS LSTSGVGVGWI RQP PG KALE WLAL I DWDDDKY 60 

QV L +ESGP ALVKPTQTLTLTCTF 5GFS LS TSG+ V W I RQP PGKALE WLAL I DWDDDKY 
Sbjct: 1 QVTLRESGPALVKPTQTLTLTCTFSGFSLSTSGMCVSWIRQPPGKMjEWlALIDWDDDKy 60 

Query: 61 YSTSLKTRLT I S KDTS KNQWLTMTNMDPVDTATYYCAR 99 

YSTSLKTRLTISKDTSKNQVVLTMTNMDPVDTATYYCAR 
Sbjct: 61 YSTSUCTRLTISKDTSKNQVVXTMTNMDPVDTATYYCAR 99 



lDHU>DP-47/V3-23. . 

Length • 98 

Score = 500 (176.0 bits). Expect = 3.4e-51, P » 3.4e-51 

Identities = 96/98 (97%) , Positives = 98/98 (100%) 

Query: 1 EVQLVESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYY 60 

EVQL+ESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYY 
Sbjct: 1 EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYY 60 

Query: 61 ADSVKGRFTI SRDNSKNTL YLQMNS LRAEDTAVYYCAR 98 

ADSVKGRFTI SRDNSKNTL YLQMNSLRAEDTAVYYCA+ 
Sbjct: 61 ADSVKGRFTI SRDNSKNTL YLQMNSLRAEDTAVYYCAK 98 

lDHV>DP-71/3dl97d...+ . 

Length =97 

Score « 510 (179.5 bits), Expect = 1.3e-52, P = 1.3e-S2 
Identities * 97/97 (100%), Positives » 97/97 (100%) 



Query: 1 QVQLQESGPGLVKPSETLSLTCTVSGGSISSYYWSWIRQPPGKGLEWIGYIYYSGSTHYM 60 

QVQLQESGPGLVXPSETLSLTCTVSGGSISSYYWSWIRQPPGKGLEWIGYIYYSGSTNYN 
Sbjct: 1 QVQLQESGPGLVKPS ETLS LTCTVSGGS I S S YYWSW I RQPPG KGLE W IG Y I YYSGS TNYN 60 

Query: 61 PSLKSRVTI S VDTS KNQFS LKLSS VTAADTAVYYCAR 97 

PSLKSRVTI SVDTS KNQFS LKLSS VTAADTAVYYCAR 
Sbjct: 61 PSLKSRVTI SVDTS KNQFS LKLSS VTAADTAVYYCAR 97 
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FIGURE 11- continued 



lDHZ>DF-74/VH-VT. . , 

Length =101 



Score » 530 (186.6 bits), Expect = i.4e-56, p = i.4e-56 
Identities = 100/101 (99%), Positives = 100/101 (99%) 

Query: 1 QVQLQQSGPGLVTCPSQTLSLTCAISGDSVSSNSAAWNWIRQSPGRGLEWLGRTYTO 60 

QVQLQQSGPGLVKPSQTLSLTCAISGDSVSSNSAAWNWIRQSP RGLEWLGRTYYRSKWY 
Sbjct: 1 QVQLQQSGPGLV1CPSQTLSLTCAISGDSVSSNSAAWNWIRQSPSRGLEWLGRTYYRSKWY 60 

Query: 61 NDYAVSVKSRITINPDTSKNQFSLQLNSVTPEDTAVYYCAR 101 

NDYAVSVXSRITINPDTSKNQFSLQLNSVTPEDTAVyycAR 
Sbjct: 61 NDYAVSVXSRITINPDTSKNQFSLQLNSVTPEDTAVYYCAR 101 



1DGX>DPJC9/012 . . .+ 

Length » 95 

Score = 466 (164.0 bits), Expect » 4.0e-48, P = 4.0e-48 
Identities = 91/95 (95%), Positives = 92/95 (96%) 

Query: 1 DIQMTQSPSSLSASVGDRvTITCRASQGISSYIAWYQQKPGKAPKLLIYAASSLQSGVPS 60 

D I QMTQS P S SLS AS VGDR VT I TCRASQ ISSYL WYQQKPGKAPKLLIYAASSLQSGVPS 
Sbjct: 1 DIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPS 60 

Query: 61 RFSGSGSGTDFTLTISSLQPEDFATYYCQQHYTTP 95 

RFSGSGSGTOFTLTI SSLQPED FATYYCQQ Y+TP 
Sbjct: 61 RFSGSGSGTDFTLTISSLQPEDFATYYCQQSYSTP 95 

1DH5>13K18 . . 

Length = 74 

Score » 367 (129.2 bits), Expect « 6.2e-38, P = 6.2e-38 
Identities = 72/74 (97%), Positives = 73/74 (98%) 

Query: 8 P ATLSLS PGERATLS CRAS QS VS S S YLAWYQQKPGQ APRLLI YGAS S RATG VPARFSG SG 67 

PATLSLSPGERATLSCRASQS VSSS YLAVnfQQKPGQAPRLLI YGASSRATG+ PARFSGSG 
Sbjct: 1 PATLS LS PGERATLSCRASQSVSSSYLAWYQQKPGQAPRLL I YGASS RATG I PARFSGSG 60 

Query: 68 SGTDFTLTI SSLEP 81 

SGTDFTLTIS LEP 
Sbjct; 61 SGTDFTLTI SRLEP 74 
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FIGURE 11- continued 



1DH4/>DPK15/A19. . ... 

Length = 100 

Score « 502 (176.7 bits), Expect = 4.5e-52, P « 4.5e-52 
Identities = 96/100 (96%), Positives = 96/100 (96%) 

Query: 1 DI\n^QSPLSL^PGEPASISCRSSQSLLHSNGyNYLDWYLQKPGQSPQLLlYLGSMRA 60 

D I VMTQS PLS LPVTPGEPAS I S CRS S QSLLHSNGYNYLDWYLQKPGQS PQLLI YLGSNRA 
SbjCC: 1 DIVMTQSPLSLPVTPGEPASISCRSSQSLLHSNGYNYLDWYLQKPGQSPQLLIYLGSNRA 60 



Query: 

SbjCt : 



61 SGVPDRFSGSGSGTDFTLKI SRVEAEDVGVYYCQQHYTTP 100 

SGVPDRFSGSGSGTDFTLKI SRVEAEDVGVYYC Q TP 
61 SGVPDRFSGSGSGTDFTLKISRVEAEDVGVYYCMQALQTP 100 



lDH6>DPK24/VkXVKlob. . 

Length ■ 101 

Score « 519 (182.7 bits), Expect « 2.0e-55, P « 2.0e-55 
Identities ■ 98/101 (97%), Positives « 101/101 (100%) 

Query: 1 DIVMTQSFDSIAVSLGERATINCRSSQSVLYSSNNKNYLA 60 

D I VMTQ S PDS LAVS LGERAT I NC + S S Q S VL YS SNNKNYLAWYQQ KPGQ P P KLL I YW ASTR 
Sbjct: 1 DIVMTQSPDSIJ^VSLGERATINCKSSQSVLYSSIWKNYLAWYQQKPGQPPKLLIYWASTR 60 

Query: 61 ESGVPDRFSGSGSGTDFTLTISSLQAEDVAVYYCQQHYTTP 101 

ESGVPDRFSGSGSGTDFTLTISSLQAEDVAVYYCQQ+Y+TP 
Sbjct: 61 ESGVPDRFSGSGSGTDFTLTISSLQAEDVAVYYCQQYYSTP 101 



1DH7>V1-17 + . . 

Length * 98 

Score = 436 (153 . S bits), Expect = 2.1e-45, P » 2.1e-4S 

Identities = 63/89 (93%), Positives = 85/89 (95%) 



Query: 



1 QSVLTQPPSVSGAPGQRVTISCSGSSSNIGSNYVSWYQQLPGTAPKLLIYDNNQRPSGVP 60 
QSVLTQPPS SG PGQRVTISCSGSSSNIGSNYV WYQQLPGTAPKLLIY NNQRPSGVP 
SbjCt: 1 QSVLTQPPSASGTPGQRVTISCSGSSSNIGSNYVYWYQQLPGTAPKLLIYSNNQRPSGVP 60 



Query: 

Sbjct: 



61 DRFSGSXSGTSASLAITGLQSEDEADYYC 89 

DRFSGSKSGTSASLAI +GL+ SEDEADYYC 
61 DRFSGS KSGTSASLAI SGLRS EDEAD YYC 89 
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FIGURE 11- continued 



lDH8>2b2.4 00B5* 

Length « 99 



Score » 448 (157.7 bits). Expect s l.6e-46, P 8 1.6e-46 
•Identities « 86/90 (95%) , Positives - 87/90 (96%) 

Query; 1 QSALTQPASVSGSPGQSITISCTGTSSDVGGyNYVSWYQQHPGKAPKLMIYDVSNRPSGV 60 

QSALTQPASVSGSPGQSITISCTGTSSDVG YN VSWYQQHPGKAPKLMIY+VS RPSGV 
Sbjct: 1 QSALTQPASVSGSPGQSITlSCTGTSSDVGSYNLVSWYQQHPGKAPKLMiyEVSKRPSGV 60 

Query: 61 SNRFSGS KSGNTASLTI SGLQAEDEAD YYC 90 

SNRFSGSKSGNTASLTISGLQAEDEADYYC 
Sbjct: 61 SNRFSGSKSGNTASLTISGLQAEDEADYYC 90 

lDH9>3r.9C5/DPL23. . 

Length » 95 

Score = 421 (148.2 bits). Expect = 2.0e-43, P = 2.0e-43 
Identities = 79/88 (89%), Positives « 82/88 (93%) 

Query: 1 SYELTQPPSVSVAPGQTARISCSGDALGDKYASWYOQKPGQAPVLVIYDDSDRPSGIPER 60 

SYELTQPPSVSV+PGQTA I+CSGD LGDKYA WYQQKPGQ+PVLVIY DS RPSGIPER 
Sbjct: . 1 S YELTQP PS VS VS PGQTAS I TCSGD KLGDKYACWYQQ KPGQ S PVLV I YQDS KRPSG I PER 60 

Query: 61 FSGSNSGNTATLTI SGTQAEDEADYYCQ 88 

FSGSNSGNTATLTISGTQA DEADYYCQ 
Sbjct: 61 FSGSNSGNTATLTI SGTQAMDEADYYCQ 88 
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FIGURE 13A 

Stucture- Based Clustering of the Hucal Library 

a) The structure-based clustering of the VH chains 

( 1DGA , 1DH0, 1DHW) , 1DHQ, 1DHU, 1DHV, 1DHZ 



b) The structure -based clustering of 7 VL chains 

Kappa Lamda 

((1DGX, 1DH5) , (1DH4, 1DH6) ) (1DH7, 1DH8 , 1DH9) 
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FIGURE 1 3B 

Structure-Based Multiple Sequence Alignment for the Clustered 
Structure Family Established by Using CE Algorithm 



a) For 1DGA, 1DHO and 1DHW of 3 VH 

1DHA- H QVQLVQSGAEVKKPGS SVKVSCKASGGTFSS - - YAI SWVRQAP GQGLEWMGG I 1 P - I FGT 

1DHO : H QVQLVQSGAEVKKPGASVKVSCKASGYTFTS - - Y YMHWVRQAPGQGL EWMGW I N P - NSGG 

1DHW : H EVQLVQSGAEVKKPGESLKI SCKGSGYSFTS - - YWIGWVRQMPGKGLEWMGI I YP -<3DSD 

1DHA : H ANYAQKFQGRVTI TADESTSTAYMEI^SIASEDTAVYYCARWGGDGF^AMDYWGQGTLVTVSS 

1DHO : H TNYAQKFQGRVTMTRDKS S STAYMELS S LRS EDTAVYY CARWGGDGFYAMD YWGQGTLVTVS S 

1DHW : H TRYSPSFQGQVTI SADKSI STAY LQWSSLKASDTAMYYCARWGGDGFYAMDYWGCGTLVTVSS 



b) For 1DGX and 1DH5 of 2 kappa VL . 

1DGX:L DIQMTQSPSSLSASVGDRVTITCRASQGI S SYLAWYQQKPGKAPKLLI 

1DH5 : L D I VLTQS P ATLS LS PGERATLS CRASQSV SS SYLAWYQQKPGQAPRLL I 

1DGX : L YAASSLOSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQHYTTPPTFGQGTKVEI K 
1DH5 :L YGAS SRATGVPARFSGSGSGTDFTLTI SSLEPEDFAVYYCOQHYTTP PTFGOGTKVE I K 

c) For 1DH4 and 1DH6 of 2 kappa VL 

1DH4 :L DIVMTQSPLSLPVTPGEPASISCRSSQSL LHSNGY-NYLDWYLQKPGQSPQLLI 

1DH6 : L DI VMTOS PDS LAVS LG ERAT I NCR5SQSV - - -LYSSNNKNYLAWYQOKPGOPPKLLI 

1DH4 : L YLGSNRASGVPDRFSGSGSGTDFTLKI SRVEAEDVGVYYCQQHYTTP PTFGOGTKVE I K 
1DH6 : L YWASTRESGVPDRFSGSGSGTDFTLTI SSLQAEDVAVYYCQQHYTTP PTFGQGTKVEI K 

d) For 1DH7, 1DH8 and 1DH9 of 3 lamda VL 

1DH7:L : OSVLTQPPS-VSGAPGORVTISCSGSSSNIGS NYVSWYQQLPGTAPKLLIYDNNQRPSG 

1DHB:L : QSALTQPAS - VSGS PGOS I TI SCTGTSSDVGGY NYVS WYOQHPGKAP KLM I YDVSNRPSG 

1DH9:L : SYELTOPPS-VSVAPGOTARISCSGDALGD KYASWYQQ KPGQAPVLV I YDDSDRPSG 

1DH7 : L : VPDRFSGS KSGTSASLAI TGLQSEDEADYYCQQHYTTP PVFGGGTKLTVLG 
1DH8 :L : VSNRFSGS KSGNTAS LTI S GLQAE DEAD YYCQQHYTTP PVFGGGTKLTVLG 
1DH9 : L : I PERFSGSNSGNTATLTI SGTQAEDEADYYCQQHYTTP PVFGGGTKLTVLG 
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Top Hits of Human 
Structure or scaffold 



FIGURE 13C 

Antibody Sequences that Adopt the Target 
f the reclustered Kucal family by Using the 
Profile HMM Method 



1) For the clustered structure family for 1DHA, 1DHO and 1DHW of vh. the top hits are 
listed below: 

The sequences ( 1DHA : H , 1DH0:H and 1DHW:H) used to build the HMM are aligned with its top 
hits froa germline sequences using HMMBR 2.1.1 

Description 



DP-73/V5-51. . 
VHVCW/COS-24+ 
DP-88/hVlOSlK. . . ♦ 
DP-10/hvl051. . 
2M27/11M27. . . 

VHGL1.8 
VHVJB 
VH6L1.2 
RR.VH1.2 
6M27 

VH251Shen* 
DP- 75 /VI -2. . 

The top hits to the profile HMM are aligned to the original structure -based sequence 
alignment are shown below. 



Score 


E 


-value M 


232.9 


4 


.5e-68 1 


232.9 


4 


. 5e-68 '1 


232.1 


8 


.le-68 1 


231.4 


1 


.3e-67 1 


231.4 


1 


.3e-«7 1 


231.4 


1 


.3e-67 1 


229.8 


3 


. 9e-67 1 


229. B 


3 


. 9e-67 1 


229.4 




5e-67 1 


229.3 


5 


.6e-67 1 


229.1 


6 


.2e-67 1 


229.0 


6 


.8e-67 1 



1DHA: H 
1DH0 : H 
1DHW:H 

DP-73/V5-51 . . .+ 

VHVCW/COS-24+ 

DP-88/hvl051K. . .+ 

DP-10/hvl051. . . ♦ 

2M27/11M27. . . 

VHGL1.8 

VHVJB 

VHGL1.2 

RR.VH1.2 

6M27 

VH251Shen+ 
DP-75/VI-2. 

1DHA : H 
1DH0 : H 
1DHW:H 

DP-73/V5-51.. 

VHVCW/COS-24+ 

DP-88/hVl051K. . .+ 

DP-10/hvlOSl. . . + 

2M27/11M27. . . 

VHGL1.8 

VHVJB 

VHGL1.2 



QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVTIQAPGQGLEWMTC 
QVQL VQSGAEVKKPGAS VKVS CKAS GYTFTS YYMHWVRQAPG QGLEWMGW 
EVQLVQSGAEVKKPGESLKISCKGSGYSFTSYWIGWVRQMPGKGLEWMGI 

EVQLVQSGAEVKKPGESLKISCKGSGYSFTSYWIGWVRQMPGKGLEWMGI 
EVQLVQSGAEVTCKPGESLKISCKGSGYSFTSYWIGWVRQMPGKGLEWMOI 
QVQLVQSGAEVKKPGSSVKVSCKASGGTFSSYAISWVRQAPGQGLEWMGG 
QVQLVQSG AEVKKPGS S VKVSCKASGGTFS S YAI S WVRQAPGQGLEWMGG 
Q VQL VQS G AEVKKPG S S VKVS C KASGGTF S S YAI S WVRQ APGQGLE WMGG 
QVQL VQSG AEVKKPG S S VKVS C KASGGTF S S YAI S WVRQAPGQG LEWMGG 
EVQLVQSGAEVKKPGESLKI SCKGSGYSFTS YWIGWVRQMPG KGLEWMGI 
QVQLVQSGAEVKKPGASVKVSCKASGYTFTSYYMHWWQAPGQGLEWMGW 
QVQLVQS G AEVKKPG S S VKVS C KAS GGTF S S YAI S WVRQ AP GQGLEWMGG 
QVQLVQSG AEVKKPGS S VKVSCKASGGTFS S YAI SWVRQAPGQGLEWMOR 
EVQLVQSGAEVKKPGESLKI SCKGSGYSFTSYWTGWVRQMPGKGLEWMGI 
QVQLVQSGAEVTOCPGASVKVSCKASGYTFTGYYMHWVRQAPGQGLEWMGW 

I IPI FGTANYAQKFQGRVTITADESTSTAYMELS SLRSEDTAVYYCAR 
INPNSGGTNYAQKFQGRVTMTRDKS SSTAYMELS SLRSEDTAVYYCAR 
I YPGDSDTRYS PS FQGQVTI SADKS I STAYLQWSSLKASDTAMYYCAR 

I YPGDS DTRYS PS FQGQVTI S AD KS I STAYLQWS S LKASDTAM YYCAR 
I YPGDSDTRYS PS FQGQVTI SADKS I STAYLQWSSLKASDTAMYYCAR 

I IPI FGTANYAQKFQGRVT I TADKSTS TAYMELS S L RS EDTAVYYCAR 
IIPIFGTANYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCAR 

I I PI FGTANYAQKFQGRVTITADESTSTAYMELS SLRSEDTAVYYCAR 
I IPI FGTANYAQKFQGRVTITADESTSTAYMELiS SLRSEDTAVYYCAR 
I YPGDSDTRYSPS FQGQVT I S AD KP I STAYLQWS S LKASDTAMYYCAR 
INPNSGGTNYAQKFQGR VTMTRDTS I STAYVELSRLRSDDTAVYYCAR 
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RK.VH1.2 
6M27 

VH251Shen+ 
DP-75/VI-2. 



. . + 



1 1 P I FGTANYAOKFQGRVTI TTDES TSTAYMELS S LRS EDTAVYYCAR 
IIPILGTANYAQKFQGRVTITADESTSTAYMELSSLRSEDTAVYYCAR 
IYPGDSDTRYSPSFQGQVTISADKSISTAYLQWSSLKASDTAMYYCAR 
INPNSGGTNYAQKFQGRVTMTRDTS I STAYMELSRLRSDDTAVYYCAR 



2) For 1DGX and 1DKS 



Scores for complete sequences (score includes all domains) : 



Sequence 

DPK9/012. . 

DPK23/L2S . 

V26B 

V3b«- 

Ll0a+ 

DPK5/VD* 

DPK6/Vb» 



Description 



Score 


E 


-value N 


228.8 


\ 


.5e-67 


221.3 


2 


.7e-6S 


221.3 


2 


.7e-6S 


221.2 




3e-€5 


218.9 


1 


-4C-64 


216.2 


2 


.3e-€4 


218.2 


2 


•3e-64 


218.2 


2 


.3e-64 



3) For 1DH4 and 1DH6. the top hits from human germline database 



scores for complete sequences (score 
Sequence Description 


includes all domains) : 

Score 


E 


-value N 


OPJC24/VkIVKlob. . .+ 


2S1 


.0 


3 


. le- 


74 


DPX1S/A19 . . . ♦ 


236 


.0 


9 


• ee- 


70 


DPK13/011. . 


20S 


.9 


1 


.2e- 


60 


OPK36/Chr22-4 


201 


.5 


2 


.Se- 


59 


DPK12/A2* 


199 


.1 


1 


.3e- 


SB 


A2b/A2c+ 


198 


.7 


1 


,7e- 


58 


DPK27/A29* 


196 


.8 


6 


.3e- 


56 


A18b* 


196 


.2 


9 


.Se- 


58 


DPK18/A17* 


196 


.0 


1 


-le- 


57 


A13* 


194 


.8 


2 


.6e- 


57 


DPK19/A1+ 


190 


.7 


4 


.3e- 


56 


DPK28/A18+ 


189 


.8 


8 


.3e- 


56 


DPK16/A23* 


188 


.1 


2 


.7e- 


55 



% 

4) For 1DH7, 1DH8 and 1DH9. the top hits from human germline database 



Scores for complete sequences (score includes all domains) : 
Sequence Description Score 


E- 


value K 


DPL12+ 


174 


.8 


5. 


7e- 


-51 


2S.2.2/V1-3* 


174 


.8 


5. 


7e- 


51 


2a2.272A12/DPLll. . . . 


174 


.3 




de- 


-51 


2C.118D9/V1-2* 


171 


.6 


5. 


le* 


-50 


VL2 ■ 1 (IGLV2S1) + 


170 


.2 


1. 


4e- 


49 


2b2.400BS* 


169 


.9 


1. 


7e- 


-49 


1V2046 


i«9 


.7 


1. 


9e- 


-49 


2d.29Dll/DPL13. . .+ 


169 


.2 


2. 


7e- 


>49 


1V216.21 


169 


.1 


2. 


9e- 


>49 


DPL10/V1-7. . .+ 


167 


.5 




9e- 


-49 
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SEQUENCE LISTING 



<110> Abmaxi s, Inc 
luo, Peizhi 



<120> STRUCTURE- BASED CONSTRUCTION OF HUMAN ANTIBODY LIBRARY 



<130> 26050-706 



<150> US 60/284,407 
<151> 2001-04-17 



<160> 28 



<170> Patentin version 3.1 



<210> 1 

<211> 120 

<212> PRT 

<213> Artificial sequence 



<220> 

■ 

<223> Human consensus antibody heavy chain variable region 
<400> 1 

Gin val Gin Leu Val Gin Ser Gly Ala Glu val Lys Lys Pro Gly ser 
1 5 10 15 

Ser val Lys val Ser Cys Lys Ala Ser Gly Gly Thr Phe ser Ser Tyr 

20 25 30 

Ala He Ser Trp val Arg Gin Ala Pro Gly Gin Gly Leu Glu Trp Met 
35 40 45 

i 

Gly Gly lie He Pro He Phe Gly Thr Ala Asn Tyr Ala Gin Lys Phe 
50 55 60 

Page 1 
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Gin Gly Arg Vai Thr He Thr Ala Asp Glu ser Thr ser Thr Ala Tyr 
65 70 75 80 

Met Glu Leu ser Ser Leu Arg ser Glu Asp Thr Ala vai Tyr Tyr cys 

85 90 95 

Ala Arg Trp Gly Gly Asp Gly Phe Tyr Ala Met Asp Tyr Trp Gly . Gin 

100 105 110 

Gly Thr Leu vai Thr vai Ser Ser 
115 120 

<210> 2 
<211> 120 
<212> PRT 

<213> Artificial sequence 



<220> 

<223> Human consensus antibody heavy chain variable region 
<400> 2 

Gin vai Gin Leu vai Gin Ser Gly Ala Glu vai Lys Lys pro Gly Ala 
1 5 10 15 

Ser vai Lys vai ser cys Lys Ala ser Gly Tyr Thr Phe Thr ser Tyr 

20 25 30 

Tyr Met His Trp vai Arg Gin Ala Pro Gly Gin Gly Leu Glu Trp Met 
35 40 45 

Gly Trp lie Asn Pro Asn ser Gly Gly Thr Asn Tyr Ala Gin Lys Phe 
50 55 60 

Gin Gly Arg vai Thr Met Thr Arg Asp Lys Ser Ser ser Thr Ala Tyr 
65 70 75 80 

Met Glu Leu ser Ser Leu Arg Ser Glu Asp Thr Ala vai Tyr Tyr Cys 

85 90 95 

Ala Arg Trp Gly Gly Asp Gly Phe tyr Ala Met Asp Tyr Trp Gly Gin 

100 105 110 

Gly Thr Leu vai Thr vai ser Ser 
115 120 
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<210> 3 

<211> 120 

<212> PRT 

<213> Artificial Sequence 



<220> 

<223> Human consensus antibody heavy chain variable region 
<400> 3 

Gin val Gin Leu Lys Glu Ser Gly Pro Ala Leu val Lys Pro Thr Gin 
1 5 10 - 15 

Thr Leu Thr Leu Thr Cys Thr Phe Ser Gly Phe ser Leu Ser Thr Ser 

20 25 30 

Gly val Gly val Gly Trp lie Arg Gin Pro Pro Gly Lys Ala Leu Glu 
35 40 45 

Trp Leu Ala Leu lie Asp Trp Asp Asp Asp Lys Tyr Tyr Ser Thr ser 
50 55 60 

Leu Lys Thr Arg Leu Thr lie ser Lys Asp Thr Ser Lys Asn Gin val 
65 70 75 80 

val Leu Thr Met Thr Asn Met Asp pro val Asp Thr Ala Thr Tyr Tyr 

85 90 95 

Cys Ala Arg Trp Gly Gly Asp Gly Phe Tyr Ala Met Asp Tyr Trp Gly 

100 105 110 

Gin Gly Thr Leu val Thr val Ser 
115 120 



<210> 4 

<211> 120 

<212> PRT 

<213> Artificial Sequence 



<220> 

<223> Human consensus antibody heavy chain variable region 
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<400> 4 

Glu val Gin Leu val Glu Ser Gly Gly Gly Leu val Gin Pro Gly Gly 
1 5 10 15 

ser Leu Arg Leu ser cys Ala Ala ser Gly Phe Thr Phe ser ser Tyr 

20 25 30 

Ala Met Ser Trp val Arg Gin Ala Pro Gly Lys Gly Leu Glu Trp val 
35 40 45 

Ser Ala He ser Gly ser Gly Gly Ser Thr Tyr Tyr Ala Asp Ser val 
50 55 60 

Lys Gly Arg Phe Thr lie ser Arg Asp Asn Ser Lys Asn Thr Leu Tyr 
65 70 75 80 

Leu Gin Met Asn ser Leu Arg Ala Glu Asp Thr Ala val Tyr Tyr Cys 

85 90 95 

Ala Arg Trp Gly Gly Asp Gly Phe Tyr Ala Met Asp Tyr Trp Gly Gin 

100 105 110 

Gly Thr Leu val Thr val ser ser 
115 120 

<210> 5 
<211> 119 
<212> PRT 

<213> Artificial Sequence 



<220> 

<223> Human consensus antibody heavy chain variable region 
<400> 5 

Gin val Gin Leu Gin Glu ser Gly Pro Gly Leu val Lys Pro ser Glu 
15 10 15 

Thr Leu Ser Leu Thr Cys Thr val Ser Gly Gly Ser lie Ser Ser Tyr 

20 25 30 

Tyr Trp Ser Trp lie Arg Gin Pro Pro Gly Lys Gly Leu Glu Trp He 
35 40 45 
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Gly Tyr lie Tyr Tyr Ser Gly ser Thr Asn Tyr Asn Pro Ser Leu Lys 
50 55 60 

Ser Arg Val Thr lie Ser val Asp Thr Ser Lys Asn Gin Phe Ser Leu 
65 70 75 80 

Lys Leu Ser Ser val Thr Ala Ala Asp Thr Ala val Tyr Tyr Cys Ala 

85 90 95 

Arg Trp Gly Gly Asp Gly Phe Tyr Ala Met Asp Tyr Trp Gly Gin Gly 

100 105 110 

Thr Leu val Thr val Ser Ser 
115 

<210> 6 
<211> 120 
<212> PRT 

<213> Artificial sequence 



<220> 

<223> Human consensus antibody heavy chain variable region 
<400> 6 

Glu val Gin Leu val Gin ser Gly Ala Glu val Lys Lys Pro Gly Glu 
15 10 15 

ser Leu Lys He Ser Cys Lys Gly ser Gly Tyr Ser Phe Thr ser Tyr 

20 25 30 

Trp He Gly Trp val Arg Gin Met Pro Gly Lys Gly Leu Glu Trp Met 
35 40 • 45 

Gly lie He Tyr Pro Gly Asp Ser Asp Thr Arg Tyr ser Pro Ser Phe 
50 55 60 

Gin Gly Gin val Thr He Ser Ala Asp Lys ser He Ser Thr Ala Tyr 
65 70 75 80 

Leu Gin Trp Ser Ser Leu Lys Ala Ser Asp Thr Ala Met Tyr Tyr Cys 

85 90 95 

Ala Arg Trp Gly Gly Asp Gly phe Tyr Ala Met Asp Tyr Trp Gly Gin 

100 105 no 
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61 y Thr Leu val Thr val ser ser 
115 120 



<210> 7 

<211> 123 

<212> PRT 

<213> Artificial sequence 



<220> 

<223> Human consensus antibody heavy chain variable region 
<400> 7 

Gin val Gin Leu Gin Gin Ser Gly pro Gly Leu val Lys Pro ser Gin 
1 5 10 15 

* 

Thr Leu Ser Leu Thr cys Ala He Ser Gly Asp ser val Ser ser Asn 

20 25 30 

Ser Ala Ala Trp Asn Trp lie Arg Gin ser Pro Gly Arg Gly Leu Glu 
35 40 45 

Trp Leu Gly Arg Thr Tyr Tyr Arg Ser Lys Trp Tyr Asn Asp Tyr Ala 
50 55 60 

val ser val Lys ser Arg lie Thr He Asn pro Asp Thr Ser Lys Asn 
65 70 75 80 

Gin phe ser Leu Gin Leu Asn ser val Thr pro Glu Asp Thr Ala val 

85 90 95 

Tyr Tyr cys Ala Arg Trp Gly Gly Asp Gly Phe Tyr Ala Met Asp Tyr 

100 105 110 

Trp Gly Gin Gly Thr Leu val Thr val Ser ser 
115 120 

<210> 8 
<211> 108 
<212> PRT 

<213> Artificial sequence 
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<220> 

<223> Human consensus antibody light chain variable region 
<400> 8 

Asp lie Gin Met Thr Gin Ser pro ser Ser Leu ser Ala ser val Gly 
1 5 10 15 

Asp Arg val Thr He Thr Cys Arg Ala ser Gin Gly lie Ser ser Tyr 

20 25 30 

Leu Ala Trp Tyr Gin Gin Lys pro Gly Lys Ala Pro Lys Leu Leu lie 
35 40 45 

Tyr Ala Ala Ser Ser Leu Gin ser Gly val pro Ser Arg phe ser Gly 
50 55 60 

■ 

Ser Gly Ser Gly Thr Asp phe Thr Leu Thr lie ser ser Leu Gin pro 
65 70 75 80 

Glu Asp Phe Ala Thr Tyr Tyr Cys Gin Gin His Tyr Thr Thr Pro pro 

85 90 95 

Thr Phe Gly Gin Gly Thr Lys val Glu lie Lys Arg 

100 105 



<210> 9 

<211> 113 

<212> PRT 

<213> Artificial Sequence 



<220> 

<223> Human consensus antibody light chain variable region 
<400> 9 

Asp lie val Met Thr Gin Ser Pro Leu Ser Leu Pro val Thr pro Gly 
15 10 15 

Glu Pro Ala ser He Ser Cys Arg Ser Ser Gin ser Leu Leu His ser 

20 25 30 

Asn Gly Tyr Asn Tyr Leu Asp Trp Tyr Leu Gin Lys Pro Gly Gin ser 
35 40 45 

Pro Gin Leu Leu lie Tyr Leu Gly ser Asn Arg Ala Ser Gly val Pro 
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50 55 60 

Asp Arg Phe ser Gly Ser Gly Ser Gly Thr Asp Phe Thr Leu Lys lie 
65 70 75 80 

ser Arg val Glu Ala Glu Asp val Gly val Tyr Tyr Cys Gin Gin His 

85 90 95 

Tyr Thr Thr Pro Pro Thr Phe Gly Gin Gly Thr Lys val Glu He Lys 

100 105 110 



Arg 



<210> 10 

<211> 109 

<212> PRT 

<213> Artificial sequence 



<220> 

<223> Human consensus antibody light chain variable region 
<400> 10 

Asp lie val Leu Thr Gin Ser pro Ala Thr Leu ser Leu Ser Pro Gly 
15 10 15 

Glu Arg Ala Thr Leu ser Cys Arg Ala ser Gin ser val ser ser ser 

20 25 30 

Tyr Leu Ala Trp Tyr Gin Gin Lys Pro Gly Gin Ala Pro Arg Leu Leu 
35 40 45 

lie Tyr Gly Ala ser ser Arg Ala Thr Gly val Pro Ala Arg phe Ser 
50 55 60 

Gly ser Gly ser Gly Thr Asp Phe Thr Leu Thr He ser Ser Leu Glu 
65 70 75 80 

Pro Glu Asp Phe Ala Val Tyr Tyr cys Gin Gin His Tyr Thr Thr Pro 

85 90 95 

Pro Thr Phe Gly Gin Gly Thr Lys val Glu He Lys Arg 

100 105 
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<212> PRT 

<213> Artificial sequence 



Abmaxi s . 706 . ST2 5 . txt 



<220> 

<223> Human consensus antibody light chain variable region 
<400> 11 

Asp lie val Met Thr Gin Ser Pro Asp ser Leu Ala val Ser Leu Gly 
1 5 10 15 

Glu Arg Ala Thr lie Asn Cys Arg Ser ser Gin Ser val Leu Tyr ser 

20 25 30 

ser Asn Asn Lys Asn tyr Leu Ala Trp Tyr Gin Gin Lys pro Gly Gin 
35 40 45 

Pro Pro Lys Leu Leu ile Tyr Trp Ala ser Thr Arg Glu ser Gly val 
50 55 60 

Pro Asp Arg Phe Ser Gly Ser Gly ser Gly Thr Asp Phe Thr Leu Thr 
65 70 75 80 

Ile Ser ser Leu Gin Ala Glu Asp val Ala val Tyr Tyr Cys Gin Gin 

85 90 95 

His Tyr Thr Thr Pro Pro Thr Phe Gly Gin Gly Thr Lys val Glu lie 

100 105 no 



Lys Arg 



<210> 12 
<211> 109 

i 

<212> PRT 

<213> Artificial Sequence 



<220> 

<223> Human consensus antibody light chain variable region 
<400> 12 
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Gin ser val Leu Thr Gin Pro pro Ser val ser Gly Ala Pro Gly Gin 
1 5 10 15 

Arg val Thr lie Ser Cys ser Gly ser Ser Ser Asn lie Gly Ser Asn 

20 25 30 

Tyr val ser Trp Tyr Gin Gin Leu pro Gly Thr Ala Pro Lys Leu Leu 
35 40 45 

lie Tyr Asp Asn Asn Gin Arg pro Ser Gly val Pro Asp Arg Phe ser 
50 55 60 

Gly ser Lys ser Gly Thr Ser Ala ser Leu Ala He Thr Gly Leu Gin 
65 70 75 80 

ser Glu Asp Glu Ala Asp Tyr Tyr cys Gin Gin His Tyr Thr Thr Pro 

85 90 95 

pro val Phe Gly Gly Gly Thr Lys Leu Thr val Leu Gly 

100 105 



<210> 13 

<211> 110 

<212> PRT 

<213> Artificial sequence 



<220> 

<223> Human consensus antibody light chain variable region 
<400> 13 

Gin Ser Ala Leu Thr Gin pro Ala ser val ser Gly ser Pro Gly Gin 
1 5 10 15 

ser lie Thr lie Ser cys Thr Gly Thr ser Ser Asp val Gly Gly Tyr 

20 25 30 

Asn Tyr val Ser Trp Tyr Gin Gin His pro Gly Lys Ala pro Lys Leu 
35 40 45 

Met lie Tyr Asp val ser Asn Arg pro Ser Gly val Ser Asn Arg Phe 
50 55 60 

Ser Gly ser Lys Ser Gly Asn Thr Ala Ser Leu Thr lie Ser Gly Leu 
65 70 75 80 
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Gin Ala Glu Asp Glu Ala Asp Tyr Tyr Cys Gin Gin His Tyr Thr Thr 

85 90 95 

Pro Pro Val Phe Gly Gly Gly Thr Lys Leu Thr val Leu Gly 

100 105 110 



<210> 14 

<211> 107 

<212> PRT 

<213> Artificial Sequence 



<220> 

<223> Human consensus antibody light chain variable region 
<400> 14 

ser Tyr Glu Leu Thr Gin pro pro ser val ser val Ala Pro Gly Gin 
15 10 15 

Thr Ala Arg lie ser Cys ser Gly Asp Ala Leu Gly Asp Lys Tyr Ala 

20 25 30 

ser Trp Tyr Gin Gin Lys Pro Gly Gin Ala Pro val Leu val lie Tyr 
35 40 45 

Asp Asp Ser Asp Arg Pro ser Gly lie Pro Glu Arg Phe ser Gly ser 
50 55 60 

Asn Ser Gly Asn Thr Ala Thr Leu Thr lie ser Gly Thr Gin Ala Glu 
65 70 75 80 

Asp Glu Ala Asp Tyr Tyr cys Gin Gin His Tyr Thr Thr Pro Pro val 

85 90 95 

Phe Gly Gly Gly Thr Lys Leu Thr val Leu Gly 

100 105 



<210> 15 

<2il> 98 

<212> PRT 

<213> Homo sapiens 

* 

Page 11 



WO 02/084277 PCT/US02/12202 



Abmaxi s . 706 . ST2 5 . t xt 

<400> 15 

Gin val Gin Leu val Gin Ser Gly Ala Glu val Lys Lys pro Gly Ser 
15 10 15 

Ser val Lys val Ser Cys Lys Ala Ser Gly Gly Thr Phe Ser ser Tyr 

20 25 30 

Ala lie ser Trp val Arg Gin Ala Pro Gly Gin Gly Leu Glu Trp Met 
35 40 45 

Gly Gly lie lie Pro lie Phe Gly Thr Ala Asn fyr Ala Gin Lys Phe 
50 55 60 

Gin Gly Arg Val Thr He Thr Ala Asp Glu Ser Thr Ser Thr Ala Tyr 
65 70 75 80 

» 

Met Glu Leu ser Ser Leu Arg ser Glu Asp Thr Ala val Tyr Tyr cys 

85 90 95 

Ala Arg 



<210> 16 

<211> 98 

<212> PRT 

<213> Homo sapiens 



<400> 16 

Glu val Gin Leu val Gin Ser Gly Ala Glu val Lys Lys Pro Gly Glu 
1 5 10 15 

Ser Leu Lys lie Ser Cys Lys Gly Ser Gly Tyr ser Phe Thr Ser Tyr 

20 25 30 

Trp lie Gly Trp val Arg Gin Met Pro Gly Lys Gly Leu Glu Trp Met 
35 40 45 

Gly lie lie Tyr Pro Gly Asp ser Asp Thr Arg Tyr Ser Pro Ser Phe 
50 55 60 

Gin Gly Gin val Thr He ser Ala Asp Lys Ser lie Ser Thr Ala Tyr 
65 70 75 80 

Page 12 



WO 02/084277 PCT/US02/12202 



Abmaxis.706.ST25.txt 
Leu Gin Trp Ser Ser Leu Lys Ala Ser Asp Thr Ala Met Tyr Tyr cys 

85 90 95 



Ala Arg 



<210> 17 

<211> 98 

<212> PRT 

<213> Homo sapiens 



<400> 17 

Gin val Gin Leu val Gin ser Gly Ala Glu val Lys Lys Pro Gly Ala 
15 10 15 

Ser val Lys val ser Cys Lys Ala ser Gly Tyr Thr Phe Thr Gly Tyr 

20 25 30 

■ 

Tyr Met His Trp val Arg Gin Ala Pro Gly Gin Gly Leu Glu Trp Met 
35 40 45 

Gly Trp lie Asn Pro Asn ser Gly Gly Thr Asn Tyr Ala Gin Lys Phe 
50 55 60 

Gin Gly Arg val Thr Met Thr Arg Asp Thr Ser lie Ser Thr Ala Tyr 
65 70 75 80 

Met Glu Leu Ser Arg Leu Arg Ser Asp Asp Thr Ala val Tyr Tyr Cys 

85 90 95 



Ala Arg 



<210> 18 

<211> 99 

<212> PRT 

<213> Homo sapiens 



<400> 18 

Gin val Thr Leu Arg Glu Ser Gly Pro Ala Leu val Lys Pro Thr Gin 
1 5 10 15 
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Thr Leu Thr Leu Thr Cys Thr Phe ser Gly Phe Ser Leu ser Thr ser 

20 25 30 

Gly Met Cys val Ser Trp lie Arg Gin Pro pro Gly Lys Ala Leu Glu 
35 40 45 

Trp Leu Ala Leu He Asp Trp Asp Asp Asp Lys Tyr Tyr ser Thr ser 
50 55 60 

Leu Lys Thr Arg Leu Thr He Ser Lys Asp Thr Ser Lys Asn Gin val 
65 70 75 80 

val Leu Thr Met Thr Asn Met Asp pro val Asp Thr Ala Thr Tyr Tyr 

85 90 95 



Cys Ala Arg 



<210> 19 

<211> 98 

<212> PRT 

<213> Homo sapiens 



«400> 19 

Glu val Gin Leu Leu Glu Ser Gly Gly Gly Leu val Gin Pro Gly Gly 
1 5 10 15 

ser Leu Arg Leu Ser Cys Ala Ala ser Gly Phe Thr Phe Ser Ser Tyr 

20 25 30 

Ala Met Ser Trp val Arg Gin Ala Pro Gly Lys Gly Leu Glu Trp val 
35 40 45 

Ser Ala lie Ser Gly Ser Gly Gly ser Thr Tyr Tyr Ala Asp Ser val 
50 55 60 

Lys Gly Arg Phe Thr lie ser Arg Asp Asn ser Lys Asn Thr Leu Tyr 
65 70 75 80 

Leu Gin Met Asn ser Leu Arg Ala Glu Asp Thr Ala val Tyr Tyr Cys 

85 90 95 



Ala Lys 
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<210> 20 

<211> 97 

<212> PRT 

<213> Homo sapiens 



<400> 20 

Gin val Gin Leu Gin Glu Ser Gly pro Gly Leu val Lys Pro ser Glu 
1 5 10 15 

Thr Leu ser Leu Thr Cys Thr val ser Gly Gly ser He ser ser Tyr 

20 25 30 

Tyr Trp ser Trp lie Arg Gin pro Pro Gly Lys Gly Leu Glu Trp lie 
35 40 45 

Gly Tyr lie Tyr Tyr ser Gly ser Thr Asn Tyr Asn pro ser Leu Lys 
50 55 60 

Ser Arg val Thr He Ser val Asp Thr ser Lys Asn Gin Phe ser Leu 
65 70 75 80 

Lys Leu Ser Ser val Thr Ala Ala Asp Thr Ala Val Tyr Tyr Cys Ala 

85 90 95 



Arg 



<210> 21 

<211> 101 

<212> PRT 

<213> Homo sapiens 



<400> 21 

Gin Val Gin Leu Gin Gin ser Gly pro Gly Leu val Lys Pro ser Gin 
15 10 15 

Thr Leu Ser Leu Thr Cys Ala lie ser Gly Asp ser val ser ser Asn 

20 25 30 

Ser Ala Ala Trp Asn Trp lie Arg Gin ser Pro ser Arg Gly Leu Glu 
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35 40 45 

Trp Leu Gly Arg Thr Tyr Tyr Arg ser Lys Trp Tyr Asn Asp Tyr Ala 
50 55 60 

val Ser val Lys Ser Arg He Thr lie Asn Pro Asp Thr Ser Lys Asn 
65 70 75 80 

Gin Phe ser Leu Gin Leu Asn Ser val Thr Pro Glu Asp Thr Ala val 

85 90 95 

Tyr Tyr Cys Ala Arg 

100 

<210> 22 

<211> 95 

<212> PRT 

<213> Homo sapiens 



<400> 22 

Asp lie Gin Met Thr Gin Ser Pro Ser ser Leu Ser Ala ser val Gly 
1 5 10 15 

Asp Arg val Thr lie Thr Cys Arg Ala Ser Gin Ser He Ser Ser Tyr 

20 25 30 

Leu Asn Trp Tyr Gin Gin Lys Pro Gly Lys Ala Pro Lys Leu Leu lie 
35 40 45 

Tyr Ala Ala Ser ser Leu Gin Ser Gly val Pro Ser Arg Phe Ser Gly 
50 55 60 

ser Gly Ser Gly Thr Asp Phe Thr Leu Thr lie Ser Ser Leu Gin Pro 
65 70 75 80 

Glu Asp Phe Ala Thr Tyr Tyr Cys Gin Gin Ser Tyr ser Thr Pro 

85 90 95 



<210> 23 

<211> 74 

<212> PRT 

<213> Homo sapiens 
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<400> 23 



pro Ala Thr Leu Ser Leu Ser Pro Gly Glu Arg Ala Thr Leu Ser Cys 
1.5 10 15 

Ara Ala ser Gin Ser val Ser Ser ser Tyr Leu Ala Trp Tyr Gin Gin 
* 20 25 30 

Lys Pro Gly Gin Ala pro Arg Leu Leu lie Tyr Gly Ala Ser Ser Arg 
35 40 45 

Ala Thr Gly lie Pro Ala Arg Phe Ser Gly Ser Gly Ser Gly Thr Asp 
SO 55 60 

Phe Thr Leu Thr lie Ser Arg Leu Glu Pro 
65 70 

<210> 24 

<211> 100 

<212> PRT 

<213> Homo sapiens 



<400> 24 

asd lie val Met Thr Gin Ser Pro Leu Ser Leu Pro val Thr Pro Gly 
1 5 10 15 

Glu Pro Ala ser lie Ser Cys Arg ser ser Gin Ser Leu Leu His ser 

20 25 30 

Asn Gly Tyr Asn Tyr Leu Asp Trp Tyr Leu Gin Lys Pro Gly Gin Ser 
35 40 45 

* 

Pro Gin Leu Leu lie Tyr Leu Gly Ser Asn Arg Ala ser Gly val Pro 
50 55 60 

asp Arg Phe Ser Gly Ser Gly Ser Gly Thr Asp Phe Thr Leu Lys He 
65 70 75 80 

Ser Arg val Glu Ala Glu Asp val Gly val Tyr Tyr Cys Met Gin Ala 

85 90 95 



Leu Gin Thr Pro 

100 
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<210> 25 

<211> 101 

<212> PRT 

<213> Homo sapiens 



<400> 25 

Asp lie Val Met Thr Gin Ser Pro Asp Ser Leu Ala val Ser Leu Gly 
1 5 10 15 

Glu Arg Ala Thr He Asn cys Lys ser Ser Gin ser val Leu Tyr ser 

20 25 30 

Ser Asn Asn Lys Asn Tyr Leu Ala Trp Tyr Gin Gin Lys Pro Gly Gin 
35 40 45 

Pro Pro Lys Leu Leu lie Tyr Trp Ala ser Thr Arg Glu Ser Gly val 
50 55 60 

Pro Asp Arg Phe ser Gly ser Gly Ser Gly Thr Asp Phe Thr Leu Thr 
65 70 75 80 

He ser ser Leu Gin Ala Glu Asp val Ala val Tyr Tyr cys Gin Gin 

85 90 '95 

Tyr Tyr ser Thr Pro 

100 

<210> 26 

<211> 89 

<212> PRT 

<213> Homo sapiens 



<400> 26 

Gin ser val Leu Thr Gin Pro Pro Ser Ala ser Gly Thr Pro Gly Gin 
15 10 15 

Arg val Thr lie Ser cys Ser Gly Ser Ser Ser Asn He Gly Ser Asn 

20 25 30 

Tyr val Tyr Trp Tyr Gin Gin Leu Pro Gly Thr Ala Pro Lys Leu Leu 
35 40 45 

Page IB 



WO 02/084277 



PCTAJS02/12202 



Abmaxi s . 706 . ST2 5 . txt 

lie Tyr ser Asn Asn Gin Arg Pro ser Gly val Pro Asp Arg Phe ser 
50 55 60 

■ 

Glv ser Lys Ser Gly Thr Ser Ala ser Leu Ala He Ser Gly Leu Arg 
gj y * cr y 70 75 80 

ser Glu Asp Glu Ala Asp Tyr Tyr cys 

85 



<210> 27 

<211> 90 

<212> PRT 

<213> Homo sapiens 



<400> 27 

Gin ser Ala Leu Thr Gin Pro Ala Ser val ser Gly ser Pro Gly Gin 
1 5 10 15 

Ser lie Thr lie Ser Cys Thr Gly Thr ser ser Asp val Gly Ser Tyr 

20 25 30 

Asn Leu val ser Trp Tyr Gin Gin His Pro Gly Lys Ala Pro Lys Leu 
35 *0 45 

Met lie Tyr Glu val Ser Lys Arg Pro Ser Gly Val Ser Asn Arg Phe 
50 55 60 

ser Glv Ser Lys Ser Gly Asn Thr Ala Ser Leu Thr lie Ser Gly Leu 
|| r y 70 75 80 

Gin Ala Glu Asp Glu Ala Asp Tyr Tyr Cys 



<210> 28 

<211> 88 

<212> PRT 

<213> Homo sapiens 



<400> 28 

Ser Tyr Glu Leu Thr Gin Pro Pro Ser val Ser val Ser Pro Gly Gin 
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1 5 10 15 

Thr Ala Ser lie Thr Cys Ser Gly Asp Lys Leu Gly Asp Lys Tyr Ala 

20 25 ' 30 

Cys Trp Tyr Gin Gin Lys Pro Gly Gin ser Pro val Leu val lie Tyr 
35 40 45 

■ 

Gin Asp ser Lys Arg Pro Ser Gly He Pro Glu Arg Phe Ser Gly Ser 
50 55 60 

Asn ser Gly Asn Thr Ala Thr Leu Thr lie Ser Gly Thr Gin Ala Met 
65 70 75 80 



Asp Glu Ala Asp T^r Tyr cys Gin 
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